Google is launching a couple of updates to its cloud-based big data products at the Hadoop Summit in Brussels today. These include the launch of the open beta of Cloud Dataflow, Google’s new service for processing huge amounts of data, as well as an update to BigQuery, which will make the company’s bid data database service available in Google’s European data centers and introduces row-level permissions.
Cloud Dataflow made its debut during Google’s annual developer conference last June. Until today, the service remained in private alpha, however. Now, any developer who is interested in giving the service a try can start using it, but because this is still a beta product, there is no SLA available yet.
As Google’s director of product management Tom Kershaw tells me, Google’s philosophy with regard to big data is to take away as much complexity as possible. What the industry has suffered from in the past is that big data is very difficult to work with,” he noted. Businesses are starting to understand that there is a lot of value in all the data they produce, but Kershaw argues that developers are still having a hard time using the tools they need to work with all of this information. “This has to be democratized,” he said. “We’ve taken our big data portfolio and made it much easier to use.”
Cloud Dataflow, which can process data both as streams and in batches, automatically scales according to the developer’s needs, for example (though it’s worth noting that Google plans to implement some controls so cost can’t get out of hand when a developer pushes more data through the system than necessary). Developers write their Cloud Dataflow code once and then Google handles all the infrastructure for them.
While Cloud Dataflow is new, BigQuery has now been around since 2010. Starting today, however, its users can also host their data in Google’s European data centers. Kershaw tells me a lot of Google’s customers have been asking for this. Given the concerns around data sovereignty in Europe, it’s actually surprising that Google didn’t roll this feature out earlier.
The other update to BigQuery is that the database now supports row-level permissions. That may sound like a minor update, but Kershaw rightly argues that it’s actually a very important new feature.
In many companies, different departments now need to access the same data, but while the marketing department may need to work with some tables and tables in your database, for example, you may not want to give them access to sensitive business data. Today, this typically means IT will make a copy of the data and share this copy with another department. Once you’ve made this copy, though, the different datasets are out of sync. “When we create datasets that are copied, we create outdated data — and wrong data,” Kershaw said. “You end up running analytics on this data and it’s wrong and old.” With row-level permissions, the administrators can now ensure that different departments only get access to the information they need without the drawbacks of making copies.
With this update, BigQuery can now also ingest up to 100,000 rows per second and table. That’s a lot of information, but not an unusual number when you analyze huge log files — something that’s become a major use case for BigQuery, Kershaw tells me.
Google’s big data lineup currently consists of BigQuery, Cloud Dataflow and the messaging service Cloud Pub/Sub. Given Google’s interest and in-house expertise in this area, chances are we will see more updates and new tools for working with large amounts of data at Google I/O next month.