Spark fragmentation undermines community

The Hadoop distribution war comes down to a final battle between Cloudera’s CDH and Hortonworks’ HDP. That wasn’t always the case.

At the peak of the market’s fragmentation, many companies offered Hadoop distributions in one form or another. These included Amazon (AWS), Cloudera, Hortonworks, IBM, MapR, Pivotal, Teradata, Intel and Microsoft (Azure).

Competition is a natural part of business, and the tech industry is no exception. Indeed, it’s competition that leads to end users getting the best possible products in their hands. However, the picture gets a little muddy when it comes to open source.

Unlike proprietary products, which are expected to operate as their own little islands, the open-source community is supposed to be play nice. One of the advantages cited by those trying to sell open-source products to the enterprise is the degree to which they are “plug and play” — there is no vendor lock-in. If you’re not happy with an open-source product’s performance, it’s supposed to be a relatively painless process to rip out the product and replace it with something else. The problem with that argument for the Hadoop ecosystem is that not everyone is referring to the same thing when they say “Hadoop.”

Hadoop isn’t a single uniform product — it’s a framework that includes a series of modules, such as HDFS and MapReduce. The term can also be used to refer to the wider ecosystem of additional software packages that work alongside Hadoop, such as Apache Hive, Pig, Spark, etc. What that means for customers is that what they think of Hadoop isn’t likely to be as easily replaceable as they may think.

Only a large, growing and dedicated community can keep Apache Spark relevant in the long term.

HDP and CDH don’t cover the exact same package, and even when they do, they’re often not the same versions. In June of last year Derek Wood, a DevOps Engineer at Cask, wrote a blog showing which versions of various software packages were supported by which versions of HDP and CDH. Suffice to say, it’s a lot to keep track of. At some level, this “versionitis” is a betrayal of what open source in general, and Hadoop in particular, are supposed to stand for.

Over the course of the past year I’ve become increasingly concerned that the Apache Spark ecosystem will go the way of Hadoop before it. Although Apache Spark is just four years old, we’re already at the point where a few vendors are looking to sell Apache Spark to customers in different formats.

Despite their protestations otherwise, these companies are essentially dividing the Apache Spark community by forcing customers to leverage their particular Apache Spark version and components. For example, while CDH 5.5.1 supported Apache Spark 1.5.0, the concurrent HDP 2.3.2.0 supported Apache Spark 1.4.1. Even worse, there is an increasing trend toward building features for particular distributions that are never committed back upstream.

People can, and should, build companies on top of open-source products (I work for one that does). However, we have a responsibility to our community and our customers to make sure that we stay true to the promise of open source. When commercial distributions are offered to customers, they should stay as close to the core as possible, contribute everything back upstream and support the latest module versions.

Problems inevitably arise when open-source projects built in academic settings (UC Berkeley’s AMPLab in the case of Apache Spark) or for internal use (think Hadoop at Yahoo) become the domains of large publicly traded companies, or VC-funded companies looking to make their way onto public markets. Open-source ideals are frequently sacrificed on the altar of creating an easily packaged product that can be sold to generate short-term profits.

But in the long term, that’s not defensible. Software is becoming a commodity that will only become cheaper to produce as time goes on. Apache Spark itself is under attack from new open-source technologies that threaten to eat into its value proposition. Apache Kafka offers a new way to analyze stream data, Apache Flink offers a new framework for “big data” and Apache Apex unifies stream and batch processing.

Only a large, growing and dedicated community can keep Apache Spark relevant in the long term. Engineers will move on to something else if they find Apache Spark too much of a pain to use because of the difficulty inherent in navigating the balkanized environment that vendors have created. And it will be the vendors themselves that suffer. Let’s not let that happen.