On The Growth Of Apache Spark

Vaibhav Nivargi Contributor

Vaibhav Nivargi is the founder and chief architect of ClearStory Data.

Editor’s Note: Vaibhav Nivargi is the founder and chief architect of ClearStory Data, a data analytics service provider.

This week the fast-growing Apache Spark community is gathering in New York City to celebrate and collaborate on one of the most popular open source projects today.

Launched in U.C. Berkeley’s AMPLab in 2009, Apache Spark has begun to catch on like wildfire during the last year and a half. Spark had more than 465 contributors in 2014, making it the most active project in the Apache Software Foundation and among big data open source projects globally.

Early on, we bet on the cluster-computing platform ourselves, rather than building our own software from scratch.

Its in-memory, parallel processing power runs programs 100X faster than Hadoop MapReduce in memory and 10X faster on disk. This allows dozens of data sources to be blended and harmonized at once.

According to Gartner, 73 percent of organizations will invest in big data by 2016, yet for many so far the promise of big data has been falling short. Spark Software is now widely adopted and was recently acknowledged in the 2014 Gray Sort Benchmark Daytona 100TB category, setting a new data sorting world record.

For dealing with big data, other benefits of working in Spark include its compatibility with Hadoop and an ability to make software code simpler to write through rich APIs across popular languages like Java, Python, Scala and SQL. It supports both structured and unstructured data, machine learning and data mining.

A fully integrated application of Spark opens the door for business leaders across sectors to run large workloads of iterative data sets in ways never imagined before. With this technology, we finally have the freedom to explore our data, even as the number of data islands in enterprises keeps growing.

Early Adopters By Sector

Early adopters of Spark by sector include consumer-packaged goods (CPG), insurance, media and entertainment, pharmaceuticals, retailers, automotive. Basically any industry where the focus is on the consumer.

Customer analytics in the CPG industry presents an ideal use case for Spark. Gaining insights on customers and their motivations are top priorities for CPG brand executives. Traditionally most organizations are limited to siloed views of data from disparate data sources that capture product and customer information. However, quickly understanding customer responses to in-store product placements, online and offline trends and location-based differences leads to a deeper understanding of the customer — and ultimately higher sales.

Fast-cycle analysis and more rapid insights provide a near real-time view across the supply chain to maximize sales by location. A blend of disparate data sets from sources like ERP and supply chain systems, together with external data like Dun & Bradstreet helps uncover deeper consumer understandings. With speedy access, convergence and analysis of numerous private, public and premium data sources, brand managers gain an actionable, holistic view to immediately see daily insights and make fast, collaborative decisions.

Similarly, in the data-driven healthcare and pharmaceuticals industry, faster and more holistic insights can speed the cycle from diagnosis to cures. The use of Apache Spark lets users process large volumes of data without any significant delay, and correlate the data against systemic patterns to alert hospital caregivers of any diagnosis of life-threatening conditions. This early-warning system not only saves lives, but also reduces hospitals’ costs through savings in medication, lab tests and other costs.

Although Spark is gaining a lot of attention, we must keep in mind that any open, distributed computing framework remains a complex beast. A purely Spark-based application requires a broad range of skills and a significant amount of detailed, hands-on work to create and maintain a complete solution to any particular set of problems.

Evolving the Spark project means that new innovations for enterprise data intelligence must focus on:

Digging Out Of The Data Hole

As we bring in more data from various sources, we create many silos, natural resting places for different types of information. There’s also an emerging reality of data lakes in the enterprise where “mounds” of data without context are dumped.

A purely Spark-based solution will not suffice in delivering on the promise of big data. Spark opens the door but to truly deliver on the promise and speed of big data, companies must combine Spark on the back-end with improved API’s, elastic scaling, job scheduling, workload management and such.

By 2016, we anticipate more enterprises across various industry sectors will understand the value that Spark’s fast-cycle analysis delivers as data-driven insights help transform how we live and work for the better as a society.

The companies and organizations that embrace new capabilities enabled by data intelligence platforms built on Apache Spark will benefit from significant advantages gained from more rapid time-to-insights and an ability to more aggressively compete against other peers in the marketplace.