The Big Data Bottleneck In The Consumer Web

Comment

Image Credits:

Editor’s note: TechCrunch contributor Semil Shah is an entrepreneur interested in digital media, consumer Internet, and social networks. Shah is based in Palo Alto and you can follow him on twitter @semil

Earlier in the year, I wrote an opinion column on TechCrunch that big data “needs to think bigger.” At the time, I kept hearing the term “big data” over and over, and wondered how much of the emerging insights and techniques would be applied toward the Internet versus the larger problems society faces, such as detecting fraud in financial markets, finding new deposits of natural resources, or helping discover the next big pharma drug.

Yet in some of my experiences monitoring the space since then, I’ve come to conclusion for now that my March 2011 column meant well, but that reality is much further behind than we’d like to think. One would assume, for instance, that big drug companies would be aggressive adopting new, external, cutting-edge techniques to analyze their own data for new insights, especially with a dangerous patent cliff looming in 2012. Turns out, oftentimes drug companies aren’t always willing to share data with third parties, which is often necessary to take advantage of big data infrastructure. While I believe that eventually the best data science will emerge to help these industries grow in new ways, for now at least, the best opportunities lie in the one area I wanted to gloss over last time: the consumer and mobile web.

Investors see the wave coming. Over the past few months, the top-tier funds have begun to make their moves. Benchmark Capital brought in Craig Weissman from Salesforce as an EIR and invested in Josh James’ new company, Domo; Accel Partners recently announced the creation of a “Big Data Fund” by reallocating monies from existing funds, which will improve data dealflow; and of course, there’s Greylock Partners, which was one of the earliest investors in this space through numerous companies and, most recently, by recruiting DJ Patil to be their “Data Scientist in Residence.”

Since March, I’ve continued to hear the term “big data” uttered by so many, yet so few seemed to grasp what it means for us and the web (yours truly, included). We all know that the major social networks (like Facebook), broadcast engines (like Twitter), self-expression tools (like Tumblr and Pinterest), and services (like Dropbox) generate ridiculous amounts of data. Add to this the growing Quantified Self movement, where connected devices from companies like Fitbit, Runkeeper, and Jawbone let us track our offline movements and analyze them online.

What happens, then, when the companies holding these big buckets of data go to cash them in?

In the earlier stages of consumer web companies, data can be used to create new products with the hopes of increasing engagement metrics. Then, as a company begins to mature, services can be built using the data that may ideally involve revenue. In these companies today, data-driven engagement products are oftentimes baked into the earliest versions of the products, such as recommendation engines for whom to follow, where to go, or what to watch.

We should not take data as a given, however. To start with, the FTC has been warning technology executives to collect data core to their business only. One might be shocked at just how many well-funded, recognizable startups haven’t been collecting good, structured data, and in some cases, they don’t collect any. For those that do get a handle on their data, they oftentimes do not possess the talent in-house to make sense of it because the skills required to do so are rare.

The consumer web companies that do interesting things with data are the ones you’d expect: Google, Facebook, Amazon, LinkedIn, and Zynga, among a small group of others. Most web startups don’t have access to the right mathematical and statistical backgrounds needed in order to extract value from the data. Some data scientists I’ve talked to will go so far as to say that consumer startups that start to grow fast need a data scientist as part of the core engineering team as soon as possible, because most engineers working in the consumer space don’t have the skills in statistics and/or machine learning required to make sense of the data. (A data scientist is someone sufficiently trained to ask the proper questions of the data in order to tease out insights that serve as the basis for building new products and that, in turn, generate income for the company).

And, herein lies the rub.

What I’m writing isn’t news. Everyone who watches the space knows it. The reality is that this talent is in short supply. To put it in terms we can understand, for every 100 great iPhone engineers, there may be one or two people who can, on their own, dig into consumer web data and discover and build new and engaging services from it.

It’s been my experience that the majority of those who do, in fact, posses these statistical, mathematical, and machine learning skills are currently busy, diligently applying their rare skills in other industries such as finance, life sciences, and the physical sciences. They oftentimes haven’t applied their techniques on data sets culled from the consumer web, nor are they interested in doing so. As a result, there are very, very, very few people like DJ Patil, Pete Skomoroch (of LinkedIn), or Jeff Hammerbacher (of Cloudera) who truly understand these techniques as they relate to the world wide web. Since we can’t clone them, the alternative has been to build data teams consisting of data specialists and pairing them with those that have extensive consumer web data experience.

So, the next time you hear someone talk about “big data” in the context of the consumer web, realize that, yes, valuable data, whether big or small, is being collected by every click we strike. The big companies with resources are keenly aware of the opportunity, but most web startups don’t have data scientists as part of their early teams, and even if they wanted to, those folks are hard to find. Therefore, it’s my opinion that “big data” is a term we’ll hear for a very long time to come. Data generated by the web will produce some of the largest data sets ever known, if they haven’t already, and somewhere within all those billions and billions of likes, retweets, upvotes, reblogs, and repins may reside truths that, yet again, change the way we live. But more data scientists will be needed to unlock them.

Photo Credit / Creative Commons by An&

More TechCrunch

The person who claims to have 49 million Dell customer records told TechCrunch that he brute-forced an online company portal and scraped customer data, including physical addresses, directly from Dell’s…

Threat actor says he scraped 49M Dell customer addresses before the company found out

The social network has announced an updated version of its app that lets you offer feedback about its algorithmic feed so you can better customize it.

Bluesky now lets you personalize main Discover feed using new controls

Microsoft will launch its own mobile game store in July, the company announced at the Bloomberg Technology Summit on Thursday. Xbox president Sarah Bond shared that the company plans to…

Microsoft is launching its mobile game store in July

Smart ring maker Oura is launching two new features focused on heart health, the company announced on Friday. The first claims to help users get an idea of their cardiovascular…

Oura launches two new heart health features

Keeping up with an industry as fast-moving as AI is a tall order. So until an AI can do it for you, here’s a handy roundup of recent stories in the world…

This Week in AI: OpenAI considers allowing AI porn

Garena is quietly developing new India-themed games even though Free Fire, its biggest title, has still not made a comeback to the country.

Garena is quietly making India-themed games even as Free Fire’s relaunch remains doubtful

The U.S.’ NHTSA has opened a fourth investigation into the Fisker Ocean SUV, spurred by multiple claims of “inadvertent Automatic Emergency Braking.”

Fisker Ocean faces fourth federal safety probe

CoreWeave has formally opened an office in London that will serve as its European headquarters and home to two new data centers.

CoreWeave, a $19B AI compute provider, opens European HQ in London with plans for 2 UK data centers

The Series C funding, which brings its total raise to around $95 million, will go toward mass production of the startup’s inaugural products

AI chip startup DEEPX secures $80M Series C at a $529M valuation 

A dust-up between Evolve Bank & Trust, Mercury and Synapse has led TabaPay to abandon its acquisition plans of troubled banking-as-a-service startup Synapse.

Infighting among fintech players has caused TabaPay to ‘pull out’ from buying bankrupt Synapse

The problem is not the media, but the message.

Apple’s ‘Crush’ ad is disgusting

The Twitter for Android client was “a demo app that Google had created and gave to us,” says Particle co-founder and ex-Twitter employee Sara Beykpour.

Google built some of the first social apps for Android, including Twitter and others

WhatsApp is updating its mobile apps for a fresh and more streamlined look, while also introducing a new “darker dark mode,” the company announced on Thursday. The messaging app says…

WhatsApp’s latest update streamlines navigation and adds a ‘darker dark mode’

Plinky lets you solve the problem of saving and organizing links from anywhere with a focus on simplicity and customization.

Plinky is an app for you to collect and organize links easily

The keynote kicks off at 10 a.m. PT on Tuesday and will offer glimpses into the latest versions of Android, Wear OS and Android TV.

Google I/O 2024: How to watch

For cancer patients, medicines administered in clinical trials can help save or extend lives. But despite thousands of trials in the United States each year, only 3% to 5% of…

Triomics raises $15M Series A to automate cancer clinical trials matching

Welcome back to TechCrunch Mobility — your central hub for news and insights on the future of transportation. Sign up here for free — just click TechCrunch Mobility! Tap, tap.…

Tesla drives Luminar lidar sales and Motional pauses robotaxi plans

The newly announced “Public Content Policy” will now join Reddit’s existing privacy policy and content policy to guide how Reddit’s data is being accessed and used by commercial entities and…

Reddit locks down its public data in new content policy, says use now requires a contract

Eva Ho plans to step away from her position as general partner at Fika Ventures, the Los Angeles-based seed firm she co-founded in 2016. Fika told LPs of Ho’s intention…

Fika Ventures co-founder Eva Ho will step back from the firm after its current fund is deployed

In a post on Werner Vogels’ personal blog, he details Distill, an open-source app he built to transcribe and summarize conference calls.

Amazon’s CTO built a meeting-summarizing app for some reason

Paris-based Mistral AI, a startup working on open source large language models — the building block for generative AI services — has been raising money at a $6 billion valuation,…

Sources: Mistral AI raising at a $6B valuation, SoftBank ‘not in’ but DST is

You can expect plenty of AI, but probably not a lot of hardware.

Google I/O 2024: What to expect

Dating apps and other social friend-finders are being put on notice: Dating app giant Bumble is looking to make more acquisitions.

Bumble says it’s looking to M&A to drive growth

When Class founder Michael Chasen was in college, he and a buddy came up with the idea for Blackboard, an online classroom organizational tool. His original company was acquired for…

Blackboard founder transforms Zoom add-on designed for teachers into business tool

Groww, an Indian investment app, has become one of the first startups from the country to shift its domicile back home.

Groww joins the first wave of Indian startups moving domiciles back home from US

Technology giant Dell notified customers on Thursday that it experienced a data breach involving customers’ names and physical addresses. In an email seen by TechCrunch and shared by several people…

Dell discloses data breach of customers’ physical addresses

Featured Article

Fairgen ‘boosts’ survey results using synthetic data and AI-generated responses

The Israeli startup has raised $5.5M for its platform that uses “statistical AI” to generate synthetic data that it says is as good as the real thing.

1 day ago
Fairgen ‘boosts’ survey results using synthetic data and AI-generated responses

Hydrow, the at-home rowing machine maker, announced Thursday that it has acquired a majority stake in Speede Fitness, the company behind the AI-enabled strength training machine. The rowing startup also…

Rowing startup Hydrow acquires a majority stake in Speede Fitness as their CEO steps down

Call centers are embracing automation. There’s debate as to whether that’s a good thing, but it’s happening — and quite possibly accelerating. According to research firm TechSci Research, the global…

Retell AI lets companies build ‘voice agents’ to answer phone calls

TikTok is starting to automatically label AI-generated content that was made on other platforms, the company announced on Thursday. With this change, if a creator posts content on TikTok that…

TikTok will automatically label AI-generated content created on platforms like DALL·E 3