IBM Pours Researchers And Resources Into Apache Spark Project

IBM today pledged it would devote 3500 researchers to the open source big data project, Apache Spark. It also announced that it was open sourcing its own IBM SystemML machine learning technology in a move designed to help push it to the forefront of big data and machine learning.

These two technologies are part of the IBM transformation strategy that includes cloud, big data, analytics and security as its pillars. As part of today’s announcement, IBM has pledged to build Spark into the core of its analytics products and will work with Databricks, the commercial entity created to support the open source Spark project.

Spark bills itself as a fast engine for processing big data projects.

“I like to think of Spark as the analytics operating system,” Rob Thomas, vice president of product development for IBM analytics told TechCrunch.

“Our belief is anyone using data in the future is going to be leveraging Spark. It allows universal access to data,” he explained.

Thomas points out that Spark is the fastest growing open source project in history, and that got his company’s attention. They’ve been working with the Spark project for several years, but really began to pay more attention when it became a top level Apache project last year.

As for the Databricks component, IBM has been working with them for a few months, Roberts reports. He says conversations led them to commit IBM’s machine learning technology after hearing that machine learning was a weakness in the Apache project.

As IBM is wont to do in these situations, it’s going all in, not only committing the 3500 researchers, but also placing Spark as a service on its Bluemix Platform as a Service offering to allow its customers to build applications using Spark.

In addition, it’s working on Spark-related projects at more than a dozen labs, and plans to open a Spark Technology Center in San Francisco, designed to encourage the data science community to build applications on top of the Spark platform. (It really wants to spur Spark development.)

In addition, it wouldn’t be an IBM project without an education component and this one is no exception as IBM announced it would partner with AMPLab, DataCamp, MetiStream, Galvanize and the Big Data University MOOC with the goal of training one million data scientists in Spark. That’s a worthy goal, but I’m wondering how many data scientists there are in the world.

IBM isn’t just giving all of these resources away out of largesse. It wants to be a part of this community because it sees these tools as the foundation for big data moving forward. If it can show itself to be a committed member to the open source project, it gives it clout with companies who are working on big data and machine learning projects using open source tools — and that opens the door to consulting services and other business opportunities for Big Blue.

IBM has plenty of money, and it’s hoping that by committing its resources to open source projects like Spark and OpenStack, it can pave the way to new business in the future as it continues to try and redefine itself as a company.