AWS launches Trainium, its new custom ML training chip

At its annual re:Invent developer conference, AWS today announced the launch of AWS Trainium, the company’s next-gen custom chip dedicated to training machine learning models. The company promises that it can offer higher performance than any of its competitors in the cloud, with support for TensorFlow, PyTorch and MXNet.

It will be available as EC2 instances and inside Amazon SageMaker, the company’s machine learning platform.

New instances based on these custom chips will launch next year.

The main arguments for these custom chips are speed and cost. AWS promises 30% higher throughput and 45% lower cost-per-inference compared to the standard AWS GPU instances.

In addition, AWS is partnering with Intel to launch Habana Gaudi-based EC2 instances for machine learning training. Coming next year, these instances promise to offer up to 40% better price/performance compared to the current set of GPU-based EC2 instances for machine learning. These chips will support TensorFlow and PyTorch.

These new chips will make their debut in the AWS cloud in the first half of 2021.

Both of these new offerings complement AWS Inferentia, which the company launched at last year’s re:Invent. Inferentia is the inferencing counterpart to these machine learning pieces, which also uses a custom chip.

Trainium, it’s worth noting, will use the same SDK as Inferentia.

“While Inferentia addressed the cost of inference, which constitutes up to 90% of ML infrastructure costs, many development teams are also limited by fixed ML training budgets,” the AWS team writes. “This puts a cap on the scope and frequency of training needed to improve their models and applications. AWS Trainium addresses this challenge by providing the highest performance and lowest cost for ML training in the cloud. With both Trainium and Inferentia, customers will have an end-to-end flow of ML compute from scaling training workloads to deploying accelerated inference.”