Google Launches Cloud Dataflow, A Managed Data Processing Service

Google expanded its Cloud Platform today with a new managed service called Cloud Dataflow that allows developer to create data pipelines to help them ingest, transform and — most importantly — analyze data.Developers can use the service to work with streaming real-time data and by uploading batches of data to the system.

For now, the service is in private beta and it’s unclear how Google will price Dataflow once it is launched to the public. At its core, Cloud Dataflow is Google’s successor to MapReduce, which has been an experimental App Engine feature for quite a while now.

The company says Dataflow is based on a number of technologies the company has been using internally, including Flume and MillWheel. Google is using Java for the first Cloud Dataflow SDK, but it is also providing a dashboard for monitoring these pipelines right from the developer console.

IMG_0245

The focus here, according to Google, is to help its users get “actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure.”

Because this is a private beta, Google isn’t publishing any throughput numbers just yet, but the service will be able to ingest virtually any kind of data in its streaming mode and newline-delimited text files, BigQuery tables and similar data in its batch mode.

IMG_0241

With this service, Google closes a major hole in its Cloud Platform lineup. For quite a while now, Amazon has offered its own data pipeline service, and with Kinesis, it launched a service that specializes in real-time data processing at its developer conference last November.

Previously, Google’s focus in this area had mostly been on MapReduce and BigQuery. Google tells BigQuery is complementary to Dataflow. Developers can use Dataflow as a part of the data ingestions into BigQuery, for example, by preparing or filtering the data for BigQuery. Once the data is cleaned, it can be written to BigQuery, where it becomes immediately accessible. At the same time, though, Dataflow can be used to read from BigQuery in case you want to join data from your database with other data sources. And to complete the cycle, you can then write all of this back to BigQuery, too, of course.

In a demo during today’s keynote, Google showed how its engineers, with the help of Twitter, used this service to do sentiment analysis around the World Cup by looking at millions of tweets.