The Business Economics And Opportunity Of Open-Source Data Science

Editor’s note: David Smith leads the open source solutions group at Revolution Analytics (a subsidiary of Microsoft). He writes daily about applications of R and predictive analytics on the Revolutions blog and is the co-author of An Introduction to R. 

Mythology can be a useful tool for uniting cultures and countries. It’s somewhat less useful for understanding the rise of modern technologies. Here are two myths that are widely associated with the big data revolution:

Myth 1: It arose spontaneously, without precedent

Myth 2: It’s largely hype, with little practical business value

Dispelling Myth No. 1

Let’s take a look at the notion that the big data revolution was some kind of “overnight sensation” that magically appeared with no warning.  In reality, the big data revolution began more than a decade ago. It was ignited by search companies like Google and Yahoo, whose business models required new frameworks and techniques for processing huge amounts of data very rapidly.

Existing database technologies didn’t fully address the issues facing these new business models. Moreover, the search companies had neither the time nor the appetite for the expensive hardware and software that a then-traditional IT solution would have required to solve the problem.

They came up with a cost-effective in-house solution. Using open source software running on inexpensive commodity hardware, they developed pioneering frameworks such as Hadoop and MapReduce for reliably handling big data.

Far from arising spontaneously, the big data revolution was driven by straightforward economics. The search companies simply could not afford to pay traditional vendors to develop the complex systems required to process big data. Neither was it economical to license proprietary software on these rapidly growing clusters under traditional pricing models. Instead, the search companies did it themselves, working with academics, startups and smaller vendors. They also leveraged the power of the global open source community, which gave them access to many of the world’s best and smartest software programmers.

Prior to the advent of frameworks like Hadoop, companies were forced into a painful decision-making process, continually deciding how much data to store and how much to jettison. Data storage wasn’t cheap, and it usually took months to update or modify the analytics provided by traditional software vendors.

The big data revolution changed all of that. The combination of open source software, affordable hardware and reliable high-bandwidth Internet services meant that data storage was no longer an ongoing financial dilemma. The advanced analytics that extracted the value from the data — developed using open source tools — could be updated and modified much more quickly and easily than proprietary software from traditional vendors.

The rise of big data was evolutionary, not magical. To be sure, it was a fairly rapid evolution — but it didn’t take place overnight. Many of the advances in big data analytics were written in R, a programming language devised in the late 1990s by two academics in New Zealand. R was developed specifically for statistical analysis, and is consistently ranked the most popular language for data science. Today, thousands of companies and organizations use R for data science applications. Here are just a few examples:

  • Google uses R to calculate the ROI on advertising campaigns.
  • Ford uses R to improve the design of its vehicles.
  • Twitter uses R to monitor user experience.
  • The US National Weather Service uses R to predict severe flooding.
  • The Rockefeller Institute of Government uses R to develop models for simulating the finances of public pension funds.
  • The Human Rights Data Analysis Group uses R to quantify the impact of war.
  • R is used frequently by the New York Times to create infographics and interactive data journalism applications.

Dispelling Myth No. 2

The second myth, that big data is hype with no clear economic benefits, is also easy to disprove. The fastest-growing sectors of the global economy are enabled by big data technologies. Mobile and social services would be impossible today without big data fueled by open-source software. (Google’s search and advertising businesses were built on top of data science applications running on open-source software.)

New business models based on emerging disciplines, such as additive manufacturing (3D printing), rapid prototyping, predictive maintenance, driverless cars, geographic information systems (GIS) and the Internet of Thing are also highly dependent on big data analytics and low-cost storage capabilities. Indeed, the entire cloud industry wouldn’t be possible without open-source-enabled big data, as RedMonk analyst Stephen O’Grady writes:

“Unlike prior eras in which industry players lacking technical competencies effectively outsourced the job of software creation to third party commercial software organizations, companies like Amazon, Facebook and Google looked around and quickly determined that help was not coming from that direction – and even if it did, the economics of traditional software licensing would be a non-starter in scale-out environments.”

These new industries will generate more than $100 billion in new revenue in 2016 alone. They will also greatly accelerate the creation of new and even larger data sets – which means that big data will be getting even bigger.

Make no mistake: Big data is not a trend, fad or flash in the pan. Prominent investors in big data technologies include Microsoft, GE, IBM, Intel, Goldman Sachs, Greylock Partners, Sequoia Capital and Accel Partners. Clearly, they believe the revolution is still in its early stages, and they’re betting that big data will become synonymous with big profits.