Two months ago, Facebook’s AI Research Lab (FAIR) published some impressive training times for massively distributed visual recognition models. Today IBM is firing back with some numbers of its own. IBM’s research groups says it was able to train ResNet-50 for 1k classes in 50 minutes across 256 GPUs — which is effectively just the polite way of saying “my model trains faster than your model.” Facebook noted that with Caffe2 it was able to train a similar ResNet-50 model in one hour on 256 GPUs using an 8k mini-batch approach.
This would be a natural moment to question why any of this matters in the first place. Distributed processing is a big sub-field of AI research, but it’s also quite arcane. Computing jobs are often so big for deep learning problems that they are most efficiently handled across a large number of GPUs instead of just a single GPU.
But as you add more GPUs, training time doesn’t naturally scale down. For example, you might assume that if it took two minutes to train with one GPU it would take one minute to train with two GPUs. In the real world it doesn’t work like this because there is some cost to splitting up and recombining complex quantitative operations.
What IBM is promising is the most efficient distributed deep learning library for breaking up a giant deep learning problem into hundreds of smaller deep learning problems. This all might seem petty in the context of a single compute job, but remember that companies like IBM and Facebook are training models all day, every day for millions of customers. Every major tech company has a stake in this, but it’s often tough to compare results companies promise because of the sheer number of variables in any research effort.
Now you would be right to question the future meaningfulness of obsessing on incremental increases in distributed scaling efficiency — and you’d be right. Hillery Hunter, director of systems acceleration and memory at IBM Research, tells me that everyone is getting really close to optimal.
“You have gotten about as much as you can out of the system and so we believe we are close to optimal. The question is really the rate at which we keep seeing improvements and whether we are still going to see improvements in the overall learning times.”
IBM didn’t stop with just the ResNet-50 results. The company continued the work testing distributed training on ResNet-101, a much larger and more complex visual recognition model. The team says that it was able to train ResNet-101 on the ImageNet-22k data set with 256 GPUs in seven hours, a fairly impressive time for the challenge.
“This also benefits folks running on smaller systems,” Hunter added.”You don’t need 256 GPUs and 64 systems to get the benefits.”
The deep learning library plays well with the major open-source deep learning frameworks, including TensorFlow, Caffe and Torch. Everything will be available via PowerAI if you want to try things out for yourself.