Deep Science: Tech giants unveil breakthroughs at computer vision summit

Computer vision summit CVPR has just (virtually) taken place, and like other CV-focused conferences, there are quite a few interesting papers. More than I could possibly write up individually, in fact, so I’ve collected the most promising ones from major companies here.

Facebook, Google, Amazon and Microsoft all shared papers at the conference — and others too, I’m sure — but I’m sticking to the big hitters for this column. (If you’re interested in the papers deemed most meritorious by attendees and judges, the nominees and awards are listed here.)

Microsoft

Redmond has the most interesting papers this year, in my opinion, because they cover several nonobvious real-life needs.

One is documenting that shoebox we or perhaps our parents filled with old 3x5s and other film photos. Of course there are services that help with this already, but if photos are creased, torn, or otherwise damaged, you generally just get a high-resolution scan of that damage. Microsoft has created a system to automatically repair such photos, and the results look mighty good.

Image Credits: Google

The problem is as much identifying the types of degradation a photo suffers from as it is fixing them. The solution is simple, write the authors: “We propose a novel triplet domain translation network by leveraging real photos along with massive synthetic image pairs.” Amazing no one tried it before!

The methods may be Greek to anyone not in the field, but the quality of the results is obvious. Cracks, grain, faded colors and other problems disappear like magic. Millions of people would love to put this to use. If Microsoft doesn’t make this into a product by the end of the year and give the authors a bonus, there’s no hope left for the company. You can learn more about the project here.

Face Swapping is a fun way to apply computer vision to photos, but as anyone who’s used one of the apps or filters can tell you, the results are often hilarious in spite of their quality, not because of it. One of the limitations is that bangs, glasses, and other face-obscuring items tend to freak the algorithms out and produce weird half-merged imagery.

Image Credits: Microsoft Research Asia

Microsoft’s next piece of work (PDF) creates a much-improved face-swapping system that better understands how to extract and apply relevant features without disrupting the target image’s lighting, hairstyle or other items that tend to confuse such algorithms.

Techniques like this are fun to play with, of course, but digital face swapping is also used frequently in TV and film, so the more work that can be automated, the less pixel-peeping the effects team has to do.

The third bit of work I want to highlight from Microsoft’s CVPR papers is an improvement on the (currently still rather rudimentary) machine translation of sign language. This is a very difficult problem for several reasons, only one of which is that the language takes place in three dimensions, making accurate detection a technical problem of its own.

But meaning in sign language doesn’t map 1:1 onto English words and symbols. That makes translating directly from signs to text difficult — there may not be a word for the sign used, or the signs may indicate one word but as a gestalt provide a different meaning. Translators “gloss” sign language into written form, but the result isn’t really grammar, like “I walk grocery and fast see bike friend” when what was actually expressed was, “When I was walking to the store, I saw my friend ride by quickly on his bike.”

The sign language’s gloss contains its immediate meaning but not how the signs should be finally understood. Microsoft’s new translation engine (PDF) recognizes the gloss of sign language on video and uses that gloss as an additional layer in the translation engine. Not being technically versed in this field I don’t think I can accurately convey exactly how, but the results are much better than existing efforts and there’s a new database for interested parties to test on as well. I just feel that any advance in this area is worth noting — one of these days we’re going to have a proper sign language translation utility and it’s going to change a lot of lives for the better.

Facebook

Automating fashion and retail is a big area of interest for companies like Facebook, which make their money from advertising. But fashion imagery, we can all agree, hardly represents the average shopper — so if any system wants to make reasonable recommendations to users, it needs to accommodate a wide range of body types and other factors. That’s what this Facebook paper is about.

Image Credits: Facebook

The VIsual Body-aware Embedding system (the acronym is a stretch) analyzes an ordinary photo of a person to estimate their body shape. This it compares with shapes it has analyzed from fashion catalogs, which are assumed to have clothing that flatters that body shape. The resulting output is tailored to the user and therefore (theoretically) won’t recommend things that are obviously a bad match.

Obviously this is hardly an objective measure, but it’s an interesting field to slip AI into rather than rely solely on secondary factors like “Others bought these shoes with this dress.” Making sure the system is body-aware rather than body-“agnostic” should help mitigate the body shape biases inherent to the fashion world.

Second is an interesting idea that makes a lot of sense. If you were to be shown a picture of a person shooting a basketball, then hear a “swisshhh” — you would know they made the shot. A computer that recognizes video and images of basketball, baskets, sports, people and other concepts may not ever make that connection.

This Facebook paper does just that, training a machine learning system to recognize actions from a single still image and a short clip of audio. The results are great, suggesting this could be used as a lightweight alternative to the computation-heavy option of analyzing every frame in a video, tracking objects and so on.

Image Credits: Facebook

Facebook’s last interesting project is a new approach to turning a single 2D image into a plausible, detailed 3D image of the person in it. To be quite frank, the advances in this one are quite technical, so I’ll let anyone interested go straight to the paper to learn the details. But the results are hard to argue with. This sort of thing isn’t necessarily useful in direct consumer applications — no one wants a 3D model of themselves — but knowing the 3D aspects of a 2D image is immensely useful for a lot of other reasons that have to do with reusing and enriching that media.

Amazon

That Amazon gets a lot of business from its product recommendations is no secret, but basing those recommendations on customer and seller data can be risky. Better to have a real understanding of a product category.

One such attempt is this paper documenting a method for filling in a missing, matching piece of an outfit. If you’re buying a navy blue top and shoes, it’s unlikely that a recommendation for an orange vinyl skirt will go over well. Not only that, but it helps to know what other items are likely to go with an outfit — if there are sunglasses, then probably not a raincoat.

Image Credits: Amazon

This recommendation retrieval system understands clothing categories and weighs various factors to figure out what’s important among them for the purposes of filling in missing categories. It also helps scale the system, so it doesn’t spend time analyzing the patterns of sweaters when the “missing” item is extremely unlikely to be a sweater.

Virtual try-on will soon be commonplace, and Amazon is working to make that simple and scalable as well. The Outfit-VITON system uses only ordinary 2D images (no special 3D models or matching pairs required) to create multigarment outfits that fit the body shape and pose. Using less data to create a good-enough result is probably more important for Amazon if it wants to roll out such a feature more or less universally.