Google launches a 9 exaflop cluster of Cloud TPU v4 pods into public preview

At its I/O developer conference, Google today announced the public preview of a full cluster of Google Cloud’s new Cloud TPU v4 Pods.

Google’s fourth iteration of its Tensor Processing Units launched at last year’s I/O and a single TPU pod consists of 4,096 of these chips. Each chip has a peak performance of 275 teraflops and each pod promises a combined compute power of up to 1.1 exaflops of compute power. Google now operates a full cluster of eight of these pods in its Oklahoma data center with up to 9 exaflops of peak aggregate performance. Google believes this makes this “the world’s largest publicly available ML hub in terms of cumulative computing power, while operating at 90% carbon-free energy.”

“We have done extensive research to compare ML clusters that are publicly disclosed and publicly available (meaning — running on Cloud and available for external users),” a Google spokesperson told me when I asked the company to clarify its benchmark. “Those clusters are powered by supercomputers that have ML capabilities (meaning that they are well suited for ML workloads such as NLP, recommendation models, etc. The supercomputers are built using ML hardware — e.g., GPUs (graphic processing units) — as well as CPU and memory. With 9 exaflops, we believe we have the largest publicly available ML cluster.”

At I/O 2021, Google’s CEO Sundar Pichai said that the company would soon have “dozens of TPU v4 pods in our data centers, many of which will be operating at or near 90% carbon-free energy. And our TPU v4 pods will be available to our cloud customers later this year.” Clearly, that took a bit longer than planned, but we are in the middle of a global chip shortage and these are, after all, custom chips.

Ahead of today’s announcement, Google worked with researchers to give them access to these pods. “Researchers liked the performance and scalability that TPU v4 provides with its fast interconnect and optimized software stack, the ability to set up their own interactive development environment with our new TPU VM architecture and the flexibility to use their preferred frameworks, including JAX, PyTorch or TensorFlow,” Google writes in today’s announcement. No surprise there. Who doesn’t like faster machine learning hardware?

Google says users will be able to slice and dice the new cloud TPU v4 cluster and its pods to meet their needs, whether that’s access to four chips (which is the minimum for a TPU virtual machine) or thousands — but also not too many, either, because there are only so many chips to go around.

As of now, these pods are only available in Oklahoma. “We have run an extensive analysis of various locations and determined that Oklahoma, with its exceptional carbon-free energy supply, is the best place to host such a cluster. Our customers can access it from almost anywhere,” a spokesperson explained.

"Read