Big Data Is Less About Size, And More About Freedom
Guest Author
Mar 16, 2010

Big Data Graphic

Editor’s note: Big Data has been around for a long time between credit card transactions, phone call records and financial markets. Companies like AT&T, Visa, Bank of America, Ebay, Google, Amazon and more have massive databases they mine for competitive advantage. But lately, Big Data is finding its way to the smallest startups. The Web and cloud computing brings Big Data everywhere. But what exactly is pushing Big Data forward?

To answer that we brought in an expert, Bradford Cross. Bradford is the Co-Founder and Head of Research at FlightCaster. FlightCaster is backed by Y Combinator, Tandem Entrepreneurs and Sherpalo Ventures. The company analyzes large data sets to predict flight delays. Bradford is chair of the Dealing with Big Data track at Cloud Connect this week.

We are in a Renaissance for computer science, engineering, and learning from data right now. The scale of data and computations is an important issue, but the data age is less about the raw size of your data, and more about the cool stuff you can do with it. Now that there is so much data, it is time to unlock its value. Really neat things are happening already—like the way the people of the world can educate themselves on all manner of issues and topics, or the way data and computing serves as leverage in other scientific and technical endeavors. There will be lots of amazing stuff on the web, but innovation will come in other domains as well.

The recent big data trend is about the democratization of large data more than its growth. In articles like the Economist’s recent piece on the data deluge, we hear about big data everywhere. We hear about what big data and the cloud mean for the enterprise, but they have had big data for a long time. eBay manages petabytes in its Teradata and Greenplum data warehouses. Sophisticated startups extracting value from big data is also nothing new—it has been happening at least since the days of Yahoo! and Google, and they have done it without the data warehousing folks.

Now focused early stage startups can get up and running faster than ever. Less technical analysts at companies like Facebook and Twitter can access massive amounts of data easily. Even individuals can undertake cool projects with big data, such as Pete Skomoroch of Data Wrangling did with trending topics for Wikipedia.

Why Now?

We do not have to build all our own hardware and software infrastructure anymore.

Pioneers such as Amazon have given us the cloud, where we have the capability to run very large server clusters at a low startup cost. Pioneers like Google have paved the way for open source projects like Hadoop and HBase, that are backed by big company contributors like Facebook.

Aardvark Logo

The combination has paved the way for a new class of data driven startup like Aardvark (just acquired by Google) and Factual, it has reduced both cost and time to market for these startups, as we showed with Flightcaster. And, it has allowed startups that were not necessarily data driven to become more analytical as they evolved, such as Facebook, LinkedIn, Twitter, and many others.

So we have big data, the cloud, and open source facilitating new data-driven startups. I like to break this trend down from the technical perspective into three chunks; storing data, processing data, and learning from data. I define “learning from data” to mean data mining, AI, machine learning, statistics, and so on.

Supersize my data. Oh wait, I’ll just have a Medium.

Cloudera Logo

The first time I heard the “Medium Data” idea was from Christophe Bisciglia and Todd Lipcon at Cloudera. I think the concept is great. Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times. For instance, a GigaOm article about big data in the cloud states:

What is becoming increasingly clear is that Big Data is the future of IT. To that end, tackling Big Data will determine the winners and losers in the next wave of cloud computing innovation.

The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data.

The more important topics are the specifics of the storage and processing infrastructure and what approaches best suit each problem. How much data do you have and what are you trying to do with it? Do you need to do offline batch processing of huge amounts of data to compute statistics? Do you need all your data available online to back queries from a web application or a service API?

Once your data and its processing are large enough to require distributing the data and the work among machines across network boundaries, things get a lot harder. You have to deal with distributed computing and make tradeoffs like a real computer scientist.

Big Data & The Cloud: Viral Buzzwords 4.0!

The cloud, and hosted services, present very interesting opportunities. One of the greatest is that people can leverage the a la carte economics of elastic computing to do things that were prohibitively expensive due to the requirements of building and maintaining their own hardware infrastructure. The interesting parts about the current cloud are its lack of entrance friction and elastic cost efficiency, the speed with which new entrants can set up, and the elastic capability to run 100 machine clusters for 1 hour if that is what is needed.

We started Flightcaster almost a year ago, and it is a good example of how startups can leverage cloud compute and storage resources, mix some open source like Hadoop with some data mining, and create interesting new technologies with relatively low capital upfront.

The cloud is not cheaper in general. Once people scale to a certain point, they move off the cloud onto dedicated hardware—not the other way around. That may change, and better hosted services may play a role in the transition, but that will take a while. In the meantime, the interesting part of the cloud is the use of elastic resources and the ability to get up and going quickly. The interesting part is the freedom it gives startups to try things they would never otherwise do.

Another notable thing about the cloud is the new architectures emerging as a result of economic and resource tradeoffs.

Amazon Web Services Logo

Storage of large amounts of data in the cloud is much cheaper with blobstores like Amazon S3 than it is to maintain an always-up cluster for a distributed datastore. If you do mostly offline batch processing and you do not need bulk storage to be online, then it is an attractive setup.

Storage and NoSQL

Taking another glimpse from the future of big data in the cloud.

A Big Data stack…will also need to emerge before cloud computing will be broadly embraced by the enterprise. In many ways, this cloud stack has already been implemented, albeit in primitive form, at large-scale Internet data centers, which quickly encountered the scaling limitations of traditional SQL databases as the volume of data exploded. Instead, high-performance, scalable/distributed, object-orientated data stores are being developed internally and implemented at scale…large web properties have been building their own so-called “NoSQL” databases, also known as distributed, non-relational database systems (DNRDBMS).

There are several misguided points here. First, there is not going to be a big data or cloud stack. Distributed systems are about making trade offs and a move toward problem-specific solutions rather than one-size-fits-all stacks. Second, enterprises already have their solution—expensive data warehousing and consulting support. Will open source projects like Hadoop supported by people like Cloudera take a chunk of the business? Sure. But as I mentioned earlier, the most interesting part about big data and the cloud is not cheaper alternatives for the enterprise, it is the opportunities it facilitates for data-driven startups.

There is a lot of talk about the NoSQL movement. The big idea here is that distributed systems are hard, require tradeoffs, and sometimes we are better off with data storage and processing that are specific to what we are doing with the data. Sometimes even with a small amount of data on a single node, there are better alternatives to SQL queries and relational databases—time series data has long been a good example.

Processing and Hadoop: The Elephant In The Room

Haddop Elephant Logo

There is a broad range of needs for processing large amounts of data. These range from simple needs like calculations for log analysis that just need to occur at scale, to middle of the road needs like BI, to complex needs like scalable modern machine learning and retrieval systems.

There are a different approaches one can use to service specific needs. Again, we see the pattern of moving away from one-size-fits-all stacks, and toward building for your needs. That said, there are very generic abstractions like Map-Reduce that work well for a lot of use cases. Distributed systems are hard to get right, so when something like Hadoop gets a lot of momentum, it retains that momentum until alternatives have the time to mature enough to solve the hard problems with fault tolerance, performance, and so forth. Not everyone is Leonardo da Vinci, so people should not attempt to create these systems on their own unless they really know what they are doing. In that sense, the cloud and big data are facilitators of open source.

Hive Elephant Bee ImagePig Logo
An important aspect of processing at scale is abstraction. Writing complex or even simple computations in raw Map-Reduce is verbose for programmers and intimidating for others who might want to play with the data. Abstractions over Map-Reduce like Pig and Hive make simple things easy, and abstractions like Cascading make hard things possible. The Map-Reduce paradigm, and Hadoop in particular, have been a big success. That said, Map-Reduce is not the only important piece of compute infrastructure. Message queues serve as the backbone of a lot of compute architectures – implementations of AMQP, such as rabbitmq, are a prime example. You can accomplish a lot with producers, consumers, and a messaging system. Distributed storage and processing systems can also be very tricky to configure and deploy, requiring a pretty deep understanding of the system – hence the business case for folks like Cloudera.

Learning from Big Data

Hal Varian, Google’s Chief Economist, recently said,
Hal Varian Picture

The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it

Unfortunately for those of us working on these problems in real life, it is not so simple. The archetypal data-renaissance man is mathematician, statistician, computer scientist, machine learner, and engineer all rolled into one. There are opportunities where you can lack some of these skills and work with a team that supplements your weak points—a startup is not one of those.

Now that we can store so much data, it is attractive to do previously unimaginable things with it. We are sure to see cool applications in fields from the internet to biotechnology to nanotechnology and fundamental materials science research. Almost all advances in every field of science and technology are now heavily dependent upon data and computing. Machine learning is serving a fantastic role as a bridge between mathematical and statistical models and the worlds of AI, computer science, and software engineering. We are exploring applications in learning from text, social networks, data from scientific experiments, and any other data sources we can get our hands on.

The data renaissance does present some difficult issues. There are not many places one can recieve a good education on working on these problems at large scale. Scaling our modeling and optimization algorithms is hard. We need to figure out how to partition and parallelize, or sometimes trade speed and scale for approximately correct calculations. Another issue is that we are often using simplistic models, albeit with pretty good results in many cases. We would like to move toward a deeper approximation of real intelligence.

But the data renaissance is here. Be a part of it.

Advertisement
Advertisement
  • http://www.noizivy.org NZN

    Interesting choice of pic for article. Big Data is a big concept and big reality. Corporate structure takes big data and transforms everything into a language construct of automation. Good? Bad? Does it matter?

    That pic reminds us of the Matrix; an idea drawn out to the extreme wherein corporate structure, profit driven managerial efficiency mixed with global and one day universal scales of competitive advantage will take the leverage of Big Data to the extreme.

    At that extreme, humans cease to be human creatures bound by time to evolution, transcending limitations as bits of Big Data. We already see and feel the consequences of this when dealing with credit agencies, the IRS, Verizon and ATT, etc.

    What part of your human experience do you own? You arrive from a womb, for the time being, and what belongs to you? Your DNA is Big Data. Your citizen status is Big Data. You are a small part of Big Data. So where is the freedom the title of this article eludes to?

    No one seems to be worried about freedom anymore. Its as if everyone thinks money will just buy it for them. Problem is, if you don’t own that which you actually should be regarded as the owner of before you are incorporated into Big Data… you never will. Not really. Not in any way that matters. Thats Big Data slavery, and we are already being exposed by it here and now in the Facebook era of our Big Data lives.

    You may think you have control… but you dont.

  • http://www.linkedin.com/in/ssaikia Shankar Saikia

    BIG DATA PROBLEM SOLVED WITH SMALL INVESTMENT!

    Flightcaster is a great example of using cloud-based resources (at a relatively low investment) to solve a big business problem – time! Time is one of our most valuable resources, and flightcaster can predict timeliness of flights. There is the return on investment in Flightcaster – it helps you save time!!!

    This is a fantastic article – looking forward to reading more like this.

  • http://tjoozey.com/?p=3588 Bradford Cross: Big Data Is Less About Size, And More About Freedom

    [...] full post on Hacker News If you enjoyed this article, please consider sharing it! Tagged with: ABOUT • [...]

  • PeopleCollector.com

    I did not manage to find any prediction’s accuracy data on flightcaster web-site. Without these data, service proposition is questionable to say the least. But anyway, thanks for the article. Not everything is revolving around iPhone, Twitter and Facebook in Silicon Valley start up arena.

  • http://altaserverhosting.com/big-data-is-less-about-size-and-more-about-freedom/ Big Data Is Less About Size, And More About Freedom | best adult web hosting

    [...] Big Data Is Less About Size, And More About Freedom Mar.16, 2010 in Dedicated Server Big Data Is Less About Size, And More About Freedom Editor’s note : Big Data has been around for a long time between credit card transactions, phone call records and financial markets. Companies like AT&T, Visa, Bank of America, Ebay, Google, Amazon and more have massive databases they mine for competitive advantage. But lately, Big Data is finding its way to the smallest startups. The Web and cloud computing brings Big Data everywhere. But what … Read more on TechCrunch [...]

  • http://gondwanaland.com/mlog/ Mike Linksvayer

    If you search the article for ‘freedom’ you find:

    In the meantime, the interesting part of the cloud is the use of elastic resources and the ability to get up and going quickly. The interesting part is the freedom it gives startups to try things they would never otherwise do.

    Flexibility or similar may have been a better word given other relevant uses of the word freedom, but it’s pretty clear what Bradford Cross means in the post here.

  • John Smith

    I saw a preso by flightcaster recently and my understanding was that there was one specific piece of data available from airlines that they used (weighted much heavier than anything else) to determine (earlier than other services – because the other services hadn’t figured out that they could gain access to this data) when flights would be delayed. Maybe that’s inaccurate, but if it’s not, then this entire post – though interesting – is really not relevant to their business. And that, in and of itself, is interesting.

  • http://Flightccaster.com Bradford Cross

    Hi John,

    I am not sure about the presentation you speak of, and I am sorry if you got the wrong understanding, but we use data from various sources. Rest assured that we wish this magic input variable you speak of existed.

  • http://analyticbits.wordpress.com/2010/03/17/the-link-between-business-intelligence-and-big-data/ The Link Between Business Intelligence and Big Data « Analytic Bits

    [...] March 17, 2010 Kevin Paul Yapjoco Leave a comment Go to comments Bradford Cross has authored an article on TechCrunch on Big Data. He is the Head of Research at FlightCaster, a company that analyzes large data sets to predict [...]

  • http://analyticbits.wordpress.com/2010/03/17/business-intelligence-and-big-data/ Business Intelligence and Big Data « Analytic Bits

    [...] March 17, 2010 Kevin Paul Yapjoco Leave a comment Go to comments Bradford Cross has authored an article on TechCrunch on Big Data. He is the Head of Research at FlightCaster, a company that analyzes large data sets to predict [...]

  • http://nessence.wordpress.com/ Alex Leverington

    This is a great post which any engineer commanding a large userbase or dataset should read.

    “There are opportunities where you can lack some of these skills and work with a team that supplements your weak points—a startup is not one of those.”

    This is the only point I somewhat disagree with. What kind of startup with a “medium” dataset wouldn’t have some form of funding?

    The hard part of hiring these skills is convincing founders of their relevance.

  • http://popurls.com/pop === popurls.com === popular today

    === popurls.com === popular today…

    yeah! this story has entered the popular today section on popurls.com…

  • tren

    I am waiting what will critique say once, cloud computing service has finally surfaced on a global scale.

  • http://jardenberg.se/b/jardenberg-kommenterar-2010-03-17/ jardenberg kommenterar – 2010-03-17 | jardenberg unedited

    [...] Big Data Is Less About Size, And More About Freedom [...]

  • http://mndoci.com/2010/03/17/data-democratized/ Data democratized

    [...] a brilliant piece entitled Big Data Is Less About Size, And More About Freedom, Bradford Cross talks about about the democratization of analyzing data at scale. As he so [...]

  • http://www.adrianscott.com/ Adrian Scott

    I was waiting to read about the freedom part also… I suggest updating to a headline more ‘correlated’ with the article itself…

  • http://www.ArticlePlayground.com/ Article Playground

    Hope everything works out :)

  • http://www.adogy.com John

    I love how they talk about “The Cloud” and how it’s going to help with everything. Although I don’t disagree with them… I think it would be nice is someone (like Tech Cruch) stepped up and defined what “The Cloud” really is. There are a million people out there with cloud products but no standard definition of what it is… Most are just glorified VPS. Sorry, going off on a tangent!

  • http://bsiscovick.tumblr.com Ben

    Great post. This captures the M.O. of our fund, IA Ventures, and is the reason we exist.

    http://www.iaventurepartners.com

    Thanks for sharing.

    -b

  • http://www.viewsflow.com/w/2Gqc Big Data Is Less About Size, And More About Freedom – Viewsflow

    [...] Big Data Is Less About Size, And More About FreedomClose [...]

  • Marco Mascioli

    There has to be a better way to store data. I am of the opinion that we try to keep too much and then we compound it by repeating the information across multiple storage and query needs. The cloud only perpetuates this problem. If we can better define the data we need through data mining and predictive modeling we can drastically reduce the data we keep. There is a lot of merit to efficiency, and I think we can create a fully integrated solution. Something that ties together data entry, data storage, and data mining in a closed loop. We need to build smarter databases that can intelligently throw away information. I spend most of my time in data mining just cutting out the fat, I could triple my productivity if I had less of what I don’t need.

  • http://www.pageonepr.com/blog/2010/03/16/cloudera-80/ Cloudera – Page One PR – Public Relations and Social Media in Silicon Valley

    [...] TechCrunch Big Data Is Less About Size, And More About Freedom [...]

  • http://www.voltdb.com Andy Ellicott

    Great article!

    After the DBMS innovation drought of the late 90′s, it’s great to see so much data management start-up activity (and adoption), along with the business/social computing models it’s helping to fuel.

    One of the busiest DBMS inventors of the last 30 years, Mike Stonebraker (inventor/co-founder of Ingres, Postgres, Illustra, et al), has been a big proponent of specialized databases (“One Size Fits All-An Idea Whose Time has Come and Gone.”

    He’s also brought a number of specialized SQL DBMS to market over the last 5 years… Streambase (CEP/stream analytics), Vertica (data warehousing), SciDB (for massive scientific data sets).

    His latest DBMS, VoltDB is in beta now. It’s a scalable SQL OLTP rdbms alternative to traditional SQL DBMS and to noSQL stores. It keeps SQL and ACID, but chucks a whole bunch of other traditional OLTP RDBMS overhead in order to achieve scale-out performance. Here’s a paper he presented on why traditional SQL databases don’t scale.

  • Monkey’s Uncle

    With all due respect, Mr. Cross would have done well to incorporate mention of Algebraix’ game changing enterprise data management software you can check it out here: http://www.algebraixdata.com

    Honesty requires that I disclose I own stock in this company, but Algebraix’ solution solves many, if not most of the big issues of big data. It can ETL large RDBS in a matter of moments and speeds up queries significantly.

    This is a truly revolutionary product that will likely become the standard for data management. Their forthcoming integrated XML-RDBMS product allows real-time data analysis between structured and unstructured data sources. While relational data experts will tell you this is impossible, it is not.

    As they say, software speaks louder than words; go see for yourself

  • http://dismalsci.wordpress.com/2010/03/17/links-for-2010-03-17/ links for 2010-03-17 « that dismal science

    [...] Big Data Is Less About Size, And More About Freedom (tags: big cloud cloudera google hadoop scalability s3 nosql machinelearning flightcaster yahoo techcrunch bradford-cross bigdata database data) [...]

  • http://austgate.co.uk/2010/03/mining-data-driving-the-web/ Mining data driving the web? « The Aust Gate

    [...] seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the growth of data on the Web. He appears to argue that data and its uses will drive the Web soon, writing: the data [...]

  • http://austgate.co.uk/2010/03/growing-and-using-data/ Growing and using data « The Aust Gate

    [...] seen an article on Techcrunch by Bradford Cross of Flightcaster regarding the growth of data on the Web. He appears to argue that data and its uses will drive the Web soon, writing: the data [...]

  • http://designershoescheapnow.com/?p=115 Cheap Discount Sale | Coach “KELSY” brown sneakers

    [...] Big Data Is Less About Size, And More About Freedom [...]

  • Ilan Ben Menachem

    This is a fantastic article – looking forward to reading more like this.

  • lolol

    NZN… you might want to put the joint down and catch your breath before freaking out as if technological innovation in data storage is the end of “privacy as we know it” and the beginning of the Orwellian apocalypse. Shouldn’t you be reading some conspiracy theory blog?

  • http://drawntoscalehq.com Nick Dimiduk

    Alex: I believe there exists a whole class of company which is just barely coming into existence, one much like FlightCaster, in fact. This company needs new kinds of tools for easily dealing with their “medium to big” data. That is exactly what I’m building.

    I think you are making an assumption which is false: I believe the size of data is no longer tied directly to the size of the company. Look at all the facebook apps out there who gain millions of users in their first month. Small company, “big” data.

  • http://canagnos.wordpress.com/2010/03/18/big-data/ Big Data « canagnos's Blog

    [...] 18, 2010 http://techcrunch.com/2010/03/16/big-data-freedom/ Posted by canagnos Filed in Uncategorized Leave a Comment [...]

  • Scott E

    I agree that a real definition of what is considered to be “The Cloud” is much needed and requested. Lots of IT professionals have a vague idea about what the term entails, but the details are hard to find and most descriptions are overly vague.

    The Tech Terms site describes Cloud Computing (http://www.techterms.com/definition/cloudcomputing) as “applications and services offered over the Internet. These services are offered from data centers all over the world, which collectively are referred to as the ‘cloud.’ This metaphor represents the intangible, yet universal nature of the Internet.” Yet Microsoft will be selling Office 2010 in an “on-premesis cloud” model as well when it is released and there are other companies using the term “local cloud” as well.

    Since this is still an emerging technology and is in constant flux, it is no doubt difficult to say that any definition of “the cloud” today will still be relevant and accurate in the future. The way I think of it, however, is that any data storage or processing that is not done on a local computer and that what used to be called a server room or farm is now being included in the cloud terminology.

    Is there a better, more concrete definition that anyone can provide?

  • http://redmonk.com/sogrady/2010/03/18/the-problem-with-big-data/ tecosystems » The Problem with Big Data

    [...] “The first time I heard the “Medium Data” idea was from Christophe Bisciglia and Todd Lipcon at Cloudera. I think the concept is great. Companies do not have to be at Google scale to have data issues. Scalability issues occur with less than a terabyte of data. If a company works with relational databases and SQL, they can drown in complex data transformations and calculations that do not fit naturally into sequences of set operations. In that sense, the “big data” mantra is misguided at times…The big issue is not that everyone will suddenly operate at petabyte scale; a lot of folks do not have that much data.” – Bradford Cross, Big Data Is Less About Size, And More About Freedom [...]

  • http://www.iaventurepartners.com/index.php/the-emergence-of-abig-data-stack The Emergence Of A Big Data Stack « IA Ventures

    [...] Cross recently wrote a great article on Techcrunch about the Big Data Renaissance. If you haven’t read it yet, you should check it [...]

  • http://followthedata.wordpress.com/2010/03/22/data-hype/ Data hype! « Follow the Data

    [...] data is less about size, more about freedom. Quote: “[T]he data renaissance is here. Be a part of it.” [Bradford Cross for [...]

  • http://www.rainstor.com Ramon Chen

    Terrific overview Brad and great points about the opportunities for startups.
    I took the liberty of linking to your post through a related post on the topic of Retaining Big Data for compliance purposes. There is no doubt that Big Data is spawning new insights and opportunities for startups, but to quote Uncle Ben in Spiderman “With great power comes great responsibility” …. whether you like it or not in the case of government regulations to retain Big Data transactions. Would love your thoughts on the topic, please take a look at our post at http://bit.ly/cNavEK

  • https://www.hypios.com/thinking/2010/04/02/five-for-friday-five-good-reads/ Five for Friday: Five Good Reads « Hypios – Thinking

    [...] Big Data Freedom:  Or how the ’10’s will be all about Data “Now that we can store so much data, [...]

  • http://redmonk.com/sogrady/2010/04/07/why-im-taking-statistics/ tecosystems » Why I’m Taking Statistics

    [...] And the infrastructure required to run the tools against the data are available, as Bradford said, is economically accessible even to [...]

  • http://www.m2mmarketplace.com/blog/2010/04/datameer-raises-2-5-million-for-apache-hadoop-based-analytics-platform/ Datameer Raises $2.5 Million For Apache Hadoop-Based Analytics Platform | M2M Marketplace

    [...] a startup that offers a big data analytics solution built on Apache Hadoop, has raised $2.5 million in Series A funding from [...]

  • http://mndoci.com/2010/05/02/data-driven-research-products/ Data-driven research products

    [...] Big Data Is Less About Size, And More About Freedom (techcrunch.com) [...]

  • http://www.facebook.com/profile.php?id=847260071 Stephen A Cronin

    It is important to have different viewpoints of the problems of the "data renaissance" I was inspired to write the "dark side" to this post. Outside of silicon valley, the prospects of solving data problems as flightcaster did change drastically as funding and talent drop on the exponential curve. http://www.skriptfoundry.com/wordpress/?p=354

    The question I want to ask is if there is enough critical mass of talent to keep pace with the growth of data drivers in enterprise worldwide?

blog comments powered by Disqus
Advertisement
Got a tip? Building a startup? Tell us