Flexible expressions could lift 3D-generated faces out of the uncanny valley

3D-rendered faces are a big part of any major movie or game now, but the task of capturing and animating them in a natural way can be a tough one. Disney Research is working on ways to smooth out this process, among them a machine learning tool that makes it much easier to generate and manipulate 3D faces without dipping into the uncanny valley.

Of course this technology has come a long way from the wooden expressions and limited details of earlier days. High-resolution, convincing 3D faces can be animated quickly and well, but the subtleties of human expression are not just limitless in variety, they’re very easy to get wrong.

Think of how someone’s entire face changes when they smile — it’s different for everyone, but there are enough similarities that we fancy we can tell when someone is “really” smiling or just faking it. How can you achieve that level of detail in an artificial face?

Existing “linear” models simplify the subtlety of expression, making “happiness” or “anger” minutely adjustable, but at the cost of accuracy — they can’t express every possible face, but can easily result in impossible faces. Newer neural models learn complexity from watching the interconnectedness of expressions, but like other such models their workings are obscure and difficult to control, and perhaps not generalizable beyond the faces they learned from. They don’t enable the level of control an artist working on a movie or game needs, or result in faces that (humans are remarkably good at detecting this) are just off somehow.

A team at Disney Research proposes a new model with the best of both worlds — what it calls a “semantic deep face model.” Without getting into the exact technical execution, the basic improvement is that it’s a neural model that learns how a facial expression affects the whole face, but is not specific to a single face — and moreover is nonlinear, allowing flexibility in how expressions interact with a face’s geometry and each other.

Think of it this way: A linear model lets you take an expression (a smile, or kiss, say) from 0-100 on any 3D face, but the results may be unrealistic. A neural model lets you take a learned expression from 0-100 realistically, but only on the face it learned it from. This model can take an expression from 0-100 smoothly on any 3D face. That’s something of an over-simplification, but you get the idea.

Computer generated faces all assume similar expressions in a row.

Image Credits: Disney Research

The results are powerful: You could generate a thousand faces with different shapes and tones, and then animate all of them with the same expressions without any extra work. Think how that could result in diverse CG crowds you can summon with a couple clicks, or characters in games that have realistic facial expressions regardless of whether they were hand-crafted or not.

It’s not a silver bullet, and it’s only part of a huge set of improvements artists and engineers are making in the various industries where this technology is employed — markerless face tracking, better skin deformation, realistic eye movements and dozens more areas of interest are also important parts of this process.

The Disney Research paper was presented at the International Conference on 3D Vision; you can read the full thing here.

Disney Research neural face-swapping technique can provide photorealistic, high-resolution video