The Big Data Bottleneck In The Consumer Web

Editor’s note: TechCrunch contributor Semil Shah is an entrepreneur interested in digital media, consumer Internet, and social networks. Shah is based in Palo Alto and you can follow him on twitter @semil

Earlier in the year, I wrote an opinion column on TechCrunch that big data “needs to think bigger.” At the time, I kept hearing the term “big data” over and over, and wondered how much of the emerging insights and techniques would be applied toward the Internet versus the larger problems society faces, such as detecting fraud in financial markets, finding new deposits of natural resources, or helping discover the next big pharma drug.

Yet in some of my experiences monitoring the space since then, I’ve come to conclusion for now that my March 2011 column meant well, but that reality is much further behind than we’d like to think. One would assume, for instance, that big drug companies would be aggressive adopting new, external, cutting-edge techniques to analyze their own data for new insights, especially with a dangerous patent cliff looming in 2012. Turns out, oftentimes drug companies aren’t always willing to share data with third parties, which is often necessary to take advantage of big data infrastructure. While I believe that eventually the best data science will emerge to help these industries grow in new ways, for now at least, the best opportunities lie in the one area I wanted to gloss over last time: the consumer and mobile web.

Investors see the wave coming. Over the past few months, the top-tier funds have begun to make their moves. Benchmark Capital brought in Craig Weissman from Salesforce as an EIR and invested in Josh James’ new company, Domo; Accel Partners recently announced the creation of a “Big Data Fund” by reallocating monies from existing funds, which will improve data dealflow; and of course, there’s Greylock Partners, which was one of the earliest investors in this space through numerous companies and, most recently, by recruiting DJ Patil to be their “Data Scientist in Residence.”

Since March, I’ve continued to hear the term “big data” uttered by so many, yet so few seemed to grasp what it means for us and the web (yours truly, included). We all know that the major social networks (like Facebook), broadcast engines (like Twitter), self-expression tools (like Tumblr and Pinterest), and services (like Dropbox) generate ridiculous amounts of data. Add to this the growing Quantified Self movement, where connected devices from companies like Fitbit, Runkeeper, and Jawbone let us track our offline movements and analyze them online.

What happens, then, when the companies holding these big buckets of data go to cash them in?

In the earlier stages of consumer web companies, data can be used to create new products with the hopes of increasing engagement metrics. Then, as a company begins to mature, services can be built using the data that may ideally involve revenue. In these companies today, data-driven engagement products are oftentimes baked into the earliest versions of the products, such as recommendation engines for whom to follow, where to go, or what to watch.

We should not take data as a given, however. To start with, the FTC has been warning technology executives to collect data core to their business only. One might be shocked at just how many well-funded, recognizable startups haven’t been collecting good, structured data, and in some cases, they don’t collect any. For those that do get a handle on their data, they oftentimes do not possess the talent in-house to make sense of it because the skills required to do so are rare.

The consumer web companies that do interesting things with data are the ones you’d expect: Google, Facebook, Amazon, LinkedIn, and Zynga, among a small group of others. Most web startups don’t have access to the right mathematical and statistical backgrounds needed in order to extract value from the data. Some data scientists I’ve talked to will go so far as to say that consumer startups that start to grow fast need a data scientist as part of the core engineering team as soon as possible, because most engineers working in the consumer space don’t have the skills in statistics and/or machine learning required to make sense of the data. (A data scientist is someone sufficiently trained to ask the proper questions of the data in order to tease out insights that serve as the basis for building new products and that, in turn, generate income for the company).

And, herein lies the rub.

What I’m writing isn’t news. Everyone who watches the space knows it. The reality is that this talent is in short supply. To put it in terms we can understand, for every 100 great iPhone engineers, there may be one or two people who can, on their own, dig into consumer web data and discover and build new and engaging services from it.

It’s been my experience that the majority of those who do, in fact, posses these statistical, mathematical, and machine learning skills are currently busy, diligently applying their rare skills in other industries such as finance, life sciences, and the physical sciences. They oftentimes haven’t applied their techniques on data sets culled from the consumer web, nor are they interested in doing so. As a result, there are very, very, very few people like DJ Patil, Pete Skomoroch (of LinkedIn), or Jeff Hammerbacher (of Cloudera) who truly understand these techniques as they relate to the world wide web. Since we can’t clone them, the alternative has been to build data teams consisting of data specialists and pairing them with those that have extensive consumer web data experience.

So, the next time you hear someone talk about “big data” in the context of the consumer web, realize that, yes, valuable data, whether big or small, is being collected by every click we strike. The big companies with resources are keenly aware of the opportunity, but most web startups don’t have data scientists as part of their early teams, and even if they wanted to, those folks are hard to find. Therefore, it’s my opinion that “big data” is a term we’ll hear for a very long time to come. Data generated by the web will produce some of the largest data sets ever known, if they haven’t already, and somewhere within all those billions and billions of likes, retweets, upvotes, reblogs, and repins may reside truths that, yet again, change the way we live. But more data scientists will be needed to unlock them.

Photo Credit / Creative Commons by An&