Deep learning takes on GIFs, fashion, doodles and more at ACM Multimedia

12/13 Replay Gallery More Galleries

Deep learning takes on GIFs, fashion, doodles and more at ACM Multimedia

This week is the Association for Computing Machinery’s Multimedia conference in Amsterdam, and the theme this year beyond a doubt is machine learning. More than 250 papers and posters will be presented and browsed at this huge gathering of researchers, and although many in some way or another leverage neural networks and deep learning, the purposes to which they put these increasingly useful tools are by turns fascinating, practical and whimsical.

I looked through all of them and found lots worth highlighting. Click on for the latest (and weirdest) in machine learning and artificial intelligence.


Predicting GIF Interestingness

Everyone loves GIFs. But not all GIFs are created equal…ly interesting. This paper describes a model for rating GIFs that uses image recognition to tell what’s in the animation and what it’s doing.

As you can see from the results, the bottom of the barrel is lined with bloggers smiling, while the model predicted others as interesting, and you don’t even have to watch them to know that it’s correct. “Truck jumping over a race car” and “Toddler gets kicked by a breakdancer” definitely set the bar high.

Key quote: “We show that GIFs of pets are considered more interesting than GIFs of people.”


An automated, wearable museum guide

So you’re at the museum and want to know what you’re looking at, but you don’t want to hunt around for the placard describing it. Fortunately, you’re wearing a computer equipped with a camera and a neural network trained to identify whatever you’re looking at.

That’s what researchers from Florence envision; they hope that these theoretical wearables would not only help people engage with the art, but also give the museum itself insight into who’s looking at what, why and for how long.

Key quote: “8 masterpieces of Donatello have been selected as artworks to be recognized.”

[La Marzocco is one of them, an emblem of Firenze.]


Magic Mirror, a virtual fashion consultant

If no friends are around to help you put together an outfit, why not ask an AI? Tech-savvy fashionistas at Tsinghua University have put machine learning to work identifying features specific to how you choose your clothing: patterns, colors, formality, sleeve length, season and so on.

The deep learning model attempts to match what you’re wearing with these various features, then makes suggestions based on the occasion, what’s trendy right now and whether things match or not. You see the results instantly on a big display in front of you, and manipulate options using gestures detected by a Kinect.

Key quote: “A practical appreciation system for automatic aesthetics-oriented clothing analysis.”


Magic Mirror again

OK, this is the same paper as the last slide. You just needed to see this chart (bigger version), because it really shows they’re not messing around. The keyword cloud is solid gold.



Detecting loiterers across multiple cameras and days

This is less a brand new field of research than applying modern facial recognition techniques to an existing one. NEC’s researchers propose a new method, AntiLoiter, for tracking people on multiple cameras and at many times by simplifying the recognition task.

Their “Luigi method,” which really is named after Mario’s brother, works to identify people who appear multiple times but with slightly different appearances — something that can confuse other systems. But the team claims speed improvements of orders of magnitude over other approaches.

Key quote: “It still faces some issues to be solved. The most critical one is how to differ loiterers from normal people who just happen to appear frequently.”



Sketch a picture of a cat, or a castle, or a crab and most people will get what you’re trying to convey — but unless you have a little talent, the drawing probably doesn’t look a lot like the real thing. That’s not a problem for this system created by Belgian computer scientists. Their system can recognize toddler-level sketches of 250 categories of objects.

This has been done a couple of times before, but one interesting aspect of this approach is that the machine learning system is exposed to the drawing as it’s created, seeing it at various fractions of completeness. Turns out that can help identify the object; after all, you ever see anyone draw the chimney on the house first? Me neither. So with just the first 20 percent of the image, they can get the right category 62 percent of the time.

Key quote: “Freehand sketches are a simple and powerful tool commonly used by humans.”


Towards a true Pictionary-playing AI

Here’s another along the sketch-identification lines. If you ever played the Nintendo game Anticipation you’ll know the heartbreak of having the computer beat you in the recognition of simple drawings. This study notes the same fact that the sequence of lines in freehand drawing is helpful to learning what the object being illustrated is.

Their system focuses on identification, not visual similarity, though, and consequently might be a serious competitor on game night.

Key quote: “Our framework can enable interesting applications such as camera-equipped robots playing the popular party game Pictionary with human players and generating sparsified yet recognizable sketches of objects.”


Pictionary errors

Another figure from the Pictionary-learning system paper. The ground truth is in blue; what the system guessed is in pink. I just thought these were hilarious.

Key quote: “Most of the misclassifications are reasonable errors.”


Creating 3D imagery from a hybrid slow-mo camera system

Many cameras shoot in high speed these days, capturing slow-motion footage. Depth-sensing cameras like the Kinect, however, can’t do that nearly as easily. This paper shows how you can combine imagery from a high-speed ordinary camera with the much more sparse info from a depth-sensing one and synthesize a slow-motion 3D scene.

Check out this higher-rez version; see how the basic method in the middle has lots of glitches? The bottom one smooths it out by intelligently bringing in data from the 2D color footage. It makes for a much more detailed image than other techniques, also shown in illustrations in the paper. This one didn’t use much machine learning, though, if I’m honest. It’s just really cool.

Key quote: “Our high-speed and high-quality RGB-D sequence can be used in many areas where the motion is fast, such as sport event, gait analysis, etc.”


Playlists that adapt to your commute

You know that good feeling when the perfect song comes on at the perfect time, at the perfect place? These Brits think they can simulate that by actually keying tracks and parts of tracks to different locations, actions and other factors that come into your walk or ride to work.

The idea is simply to automatically play songs that fit well with the user’s current situation: something relaxing for a nice view, a pump-up jam for trucking up a hill and what have you. While the tracks were manually selected, they were trimmed, crossfaded and otherwise changed dynamically on the go. It’s a proof of concept with very little AI behind it, but future versions could have much more autonomy.

Key quote: “Two participants highlighted the incongruence between the fixed playlist, which starts with two high intensity tracks, and that part of the walk from the stately home along a tree-lined avenue.”


Using facial features for "acceptable" low-bandwidth video calls

If you want a decent image for a video call, say 720p over Skype, you need to make sure both ends can handle a megabit or two — and sometimes that’s just not possible. This collaboration between several Chinese institutions packs a decent image into a data stream two orders of magnitude smaller.

By dynamically identifying facial features and applying those movements to the previous frame instead of sending fresh full-resolution imagery, this method gets a basic stream through in under 30 kilobits. That’s less than many an audio call! You won’t want to print the resulting imagery, but it should be good enough for quick calls on mobile.

Key quote: “We emphasize that our goal is very aggressive”


Detecting sarcasm that plays off images

Linguists and computer scientists have been trying for years to reliably detect sarcasm, and they’ve had great success, just astounding. /s

Turns out it’s quite difficult, especially if you have to see and understand an accompanying image to get the joke. That particular case is what this study looked at. The model the researchers created had to get some idea of what the image depicted, then contrast that with the text, which may or may not be treating the subject with levity. It worked the best on Instagram, where the text and image are most closely related.

Key quote: “In modern online platforms, hashtags and emojis are common mechanisms to reveal the speaker’s true sentiment.”


All the rest

This is only a tiny fraction of what’s being presented and demoed this week; if you’re curious about one topic or another, go ahead and scroll through this immense list of papers and presentations.