Dispatches from the cutting edge of computer vision

0/13 Replay Gallery More Galleries

Dispatches from the cutting edge of computer vision

The International Conference on Computer Vision just wrapped up, and at it the finest minds in CV and machine learning compared notes, gave presentations, and probably just marveled at how far we’ve come in the last few years. The world of assistive AI and self-driving cars requires computer vision to advance by leaps and bounds, and the researchers of the world are happy to oblige.

Here are a handful of the most interesting projects, along with some extremely simplified explanations of why they’re so cool.


Raising smartphone photos to DSLR quality

Don’t let the fundamental inferiority of your phone’s smaller sensor and lens get in the way of photographic greatness. This paper looked at photos of the same exact scenes taken on several platforms and modeled the differences between them. The result is an algorithm that does more than resize a low-quality photo — it converts it on a deeper level, intelligently refining details and colors. It can’t create what isn’t there, but it may help improve photos beyond just tweaking the curves and contrast.


Improving dual-lens smartphone portraits

Adding faux background blur is all the rage in dual-camera smartphones, but it’s not as simple as using a magic wand and selecting the person, then blurring the rest. And visually complicated scenes with complex hair or clothing tend to confound the algorithms that decide what’s part of a person and what isn’t. This work from Tencent and Hong Kong researchers puts two more basic computer vision tools together to form a single robust one.

On one hand, the system uses simple optical flow to indicate obvious boundaries in the image, and an object recognition system to segment the image into obvious parts. By combining the data from these two analyses, errors where the system might have mistaken a post for an arm or the like are reduced, and a much more accurate map of the image is created. Now the blur can be added!


Creating photorealistic images from scratch on demand

Imagine a house, but it’s upside-down, and made of meat, and someone’s pouring mustard all over it. Not the most pleasant image, but you didn’t have any trouble picturing it in your mind, right? Having computers do the same thing would be a powerful tool and also is just an interesting challenge on its own.

It’s actually been done before, but the results aren’t pretty. In this paper, however, the researchers essentially have the computer make first attempt based on its knowledge of words and images, then a separate algorithm evaluates the resulting picture and makes suggestions, and the picture is refined. It’s a bit like making a rough sketch of what you’re thinking of, then looking at it and fixing it for the next iteration. The pictures are still pretty crude, but they’re recognizable and that’s what matters.


Creating photorealistic images from scratch on demand, but different

Okay, this one is similar but different. Imagine you wanted to create a scene with the people here, the trees here, and the mountains here. You give that information to this AI system and it searches through its database of imagery, finding pieces that fit the shape and size you require and intelligently pasting them together.

The resulting images are remarkably high quality — about as good as you see in those mock-ups of buildings where people and benches are put in, obviously not real but plausible. You could mock up a home, a street scene, or park with no more effort than you might throw together a sketch in MSPaint.


The last one, but backwards

One of the hardest parts of training self-driving cars is giving them footage that’s adequately labeled: here’s a cyclist, here’s a parked car, here’s a pylon, etc. If this can be done reliably, you can annotage hours of video in seconds, giving the computer vision systems that watch the road lots of extra information to work with. That’s the goal with this paper, which documents a new method that adds a bit of depth perception to the mix to make identifying objects much easier. It gives the vision system a bit of common sense with which to tell under trying circumstances that no, that truck doesn’t smoothly transition into a nearby trolley with similar colors and motion — they’re two distinct objects. The result is a more confident labeling of objects and regions in an image.


Real-time arbitrary style transfer

Style transfer neural networks are the things you’ve probably seen that make your video look like an impressionist painting or some other look that would take forever to do manually. They’re cool, but they’re generally limited to a pre-trained set of looks that take a while for the system to get straight.

This paper describes a new style transfer network that not only works in real time, but can take any scene or painting as input and immediately apply it. Don’t like the palette of Starry Night? Find a copy of The Scream and see if that’s more your style. There are even intensity controls and all that jazz. Expect an app (or a spinoff sale to Snap or Facebook) in short order.


Captioning complex overlapping events in video

Getting a computer to describe what’s happening in a video is difficult enough, but scenes are often more complex than a single sentence like “the child walks across the room.” That may be the main event, but what about the dog that barks at her halfway through? What about the parents cheering at the end? Videos often include many events, related and unrelated, and any viewer could easily describe all of them. So why not a machine learning system?

That’s what this paper describes: a system that can describe overlapping and perhaps relate events with varying lengths and starting points. You can imagine how useful this would be in finding the correct part of a long video on YouTube — you could just skip to “the part where the gorilla shows up.”


Describing images with natural language

Say you saw the image at left. Of the two following captions, which seems like the better, more human description? “A cow standing in a field with houses” or “Grey cow walking in a large green field in front of a house”? The latter, probably. But computers don’t have any natural understanding of what makes a description sound human — unless they’re taught to make their own descriptions resemble those written by humans.

In this paper, one neural network creates descriptions of a scene, while another compares that description to human-created ones, and rates those more highly that better resemble our own style of speech. This could lead to less stilted captioning of images and video — less “baby walks to car” and more “a little girl walks towards a beige minivan.”


Expecting unexpected relationships between objects – weakly supervised

One weakness of machine learning systems is essentially that their “vocabulary” of actions and items is often limited. It may understand that people ride horses, but not dogs. Therefore, if someone is riding something, it must not be a “dog” — or if someone is on top of a dog, they must not be “riding” it. But unusual combinations of objects and actions happen all the time – in fact, they’re usually the most worthy of documenting!

This system was trained to recognize objects and the relationships between them based on spatial cues, regardless of what type of objects were pictured. So although the system may never have seen a pig frying a pancake, it will be able to recognize it when it sees it — because it has a general idea of what a pig looks like, what a pancake looks like, and what frying looks like, and it puts them all together.


Answering complex questions

When humans ask questions about an image or situation, they don’t always use the most precise language. For example, instead of saying “is there a person behind the blue car?” you might ask, “is anyone behind the car?” Unless the system knows already what “anyone” is and what car you’re referring to, it might choke. These researchers are working on a method for machine learning systems to essentially reason on the fly, making a best guess for what you mean and then putting together a short program that attempts to find an answer.

It’s a matter of figuring out what problems need to be solved (how many of this are there, how do you describe this thing) and how those things might relate — which, for a computer, is pretty hard. But this paper puts together a pretty effective system nevertheless.


Seeing around corners without looking

You know how sometimes you can sort of tell if, for example, a TV is on around the corner because you can see its light reflecting on the shiny floor? If you paid really close attention, you might actually be able to figure out much more about the scene from those subtle variations of light. And that’s what this system does.

By looking VERY closely at the light that’s visible at different angles from a corner (but without going around it), this system puts together a “1-D video” showing basic features like colors and spatial relationships. It can’t tell much, but seeing anything at all just by studying the ground near a corner is pretty impressive.


Counting heads

Knowing how many people are at an event is critical for planners and venue managers, but unless you’re carefully tracking everyone who enters and leaves, it can be easy to lose count. A human can ballpark it, and say “about 250” people in a room, but Fire Marshals tend to like exact numbers. This system aims to count people in an image quickly and accurately — and does so with better success than any other methods out there.


Inferring road layouts from aerial imagery

Automatically figuring out how the roads go just from aerial images is something people have been trying to do for years — but it’s hard! It’s only now that machine learning systems can do the image analysis piece and reason well about parts they can’t actually see. This one was trained on a large part of Toronto, then set loose on a different portion of the city. The results are pretty solid.

The green lines are where it got things right; red is false positives, and blue is where it failed to label a road. It’s still not perfect, and up close the lines get a little wiggly, but for an entirely automated first pass it isn’t bad. Human workers or a different system could handle the next bit.