Today to kick off Spark Summit, Databricks announced a Serverless Platform for Apache Spark — welcome news for developers looking to reduce time spent on cluster management. The move to simplify developer experiences is set to be a major theme of the event overall. In addition to Serverless, the company also introduced Deep Learning Pipelines, a library that makes it easy to mix deep learning frameworks with Spark.
If you haven’t been following the latest developments in cloud-based data processing, Databricks is the commercial manifestation of the open-source Apache Spark project. The company’s engineers spend their days building tools to support the Spark ecosystem, like those being announced today.
As data becomes a larger part of decision making within large enterprises, new users are facing the daunting challenge of dealing with data pipelines and cloud infrastructure. Counter to what you might think, serverless doesn’t literally mean that data manipulation occurs without servers. It merely means that users can accomplish tasks by drawing from one managed pool of computing resources — alleviating the need for every user to have to perform low-level tinkering.
“SQL is stateless so it isn’t hard to work with, but making data science serverless is hard because it has states,” Databricks CEO Ali Ghodsi explained to me in an interview.
If Serverless is Databricks attempt at breadth, Deep Learning Pipelines is the company’s attempt at depth. I still wouldn’t say that TensorFlow and other comparative deep learning frameworks are “easy” to use, but they’re a heck of a lot easier to use than LISP. This has made deep learning part of an increasing number of workflows, even if its use isn’t quite commonplace yet.
“If you want to distribute TensorFlow, you have to construct graphs manually and direct what goes to what machine,” added Ghodsi. “That’s really hard if you want to run on 100 machines.”
Databricks’ new open-source library enables developers to convert deep learning models into SQL functions. Users can perform transfer learning with Spark MLlib Pipelines and reap the benefits of distributing computing with Spark.
Lastly, Ghodsi noted that Databricks’ Structured Streaming is now generally available. The API supports the processing of sequential data streams. The company says that it prioritized minimizing latency during the Structured Streaming development process. This ultimately has both cost and speed implications for customers dealing with problems like anomaly detection.