Everything you know about computer vision may soon be wrong

Computer vision could be a lot faster and better if we skip the concept of still frames and instead directly analyze the data stream from a camera. At least, that’s the theory that the newest brainchild spinning out of the MIT Media lab, Ubicept, is operating under.

Most computer vision applications work the same way: A camera takes an image (or a rapid series of images, in the case of video). These still frames are passed to a computer, which then does the analysis to figure out what is in the image. Sounds simple enough.

But there’s a problem: That paradigm assumes that creating still frames is a good idea. As humans who are used to seeing photography and video, that might seem reasonable. Computers don’t care, however, and Ubicept believes it can make computer vision far better and more reliable by ignoring the idea of frames.

The company itself is a collaboration between its co-founders. Sebastian Bauer is the company’s CEO and a postdoc at the University of Wisconsin, where he was working on lidar systems. Tristan Swedish is now Ubicept’s CTO. Before that, he was a research assistant and a master’s and Ph.D. student at the MIT Media Lab for eight years.

“There are 45 billion cameras in the world, and most of them are creating images and video that aren’t really being looked at by a human,” Bauer explained. “These cameras are mostly for perception, for systems to make decisions based on that perception. Think about autonomous driving, for example, as a system where it is about pedestrian recognition. There are all these studies coming out that show that pedestrian detection works great in bright daylight but particularly badly in low light. Other examples are cameras for industrial sorting, inspection and quality assurance. All these cameras are being used for automated decision-making. In sufficiently lit rooms or in daylight, they work well. But in low light, especially in connection with fast motion, problems come up.”

The company’s solution is to bypass the “still frame” as the source of truth for computer vision and instead measure the individual photons that hit an imaging sensor directly. That can be done with a single-photon avalanche diode array (or SPAD array, among friends). This raw stream of data can then be fed into a field-programmable gate array (FPGA, a type of super-specialized processor) and further analyzed by computer vision algorithms.

The newly founded company demonstrated its tech at CES in Las Vegas in January, and it has some pretty bold plans for the future of computer vision.

“Our vision is to have technology on at least 10% of cameras in the next five years, and in at least 50% of cameras in the next 10 years,” Bauer projected. “When you detect each individual photon with a very high time resolution, you’re doing the best that nature allows you to do. And you see the benefits, like the high-quality videos on our webpage, which are just blowing everything else out of the water.”

TechCrunch saw the technology in action at a recent demonstration in Boston and wanted to explore how the tech works and what the implications are for computer vision and AI applications.

A new form of seeing

Digital cameras generally work by grabbing a single-frame exposure by “counting” the number of photons that hit each of the sensor pixels over a certain period of time. At the end of the time period, all of those photons are multiplied together, and you have a still photograph. If nothing in the image moves, that works great, but the “if nothing moves” thing is a pretty big caveat, especially when it comes to computer vision. It turns out that when you are trying to use cameras to make decisions, everything moves all the time.

Of course, with the raw data, the company is still able to combine the stream of photons into frames, which creates beautifully crisp video without motion blur. Perhaps more excitingly, dispensing with the idea of frames means that the Ubicept team was able to take the raw data and analyze it directly. Here’s a sample video of the dramatic difference that can make in practice:

Number plate recognition using Ubicept’s technology

“SPAD sensors are manufactured using a CMOS process,” Swedish said, referring to the tech that’s used in a lot of existing digital cameras. “That means they are very scalable in terms of making them.”

Instead of creating image frames, however, the SPADs are able to detect individual photos, time-stamping them very accurately. The technology is usually used in lidar, where you send a pulse of light and wait for it to return, and you measure the difference in time. The SPADs Ubicept is removing the need for an active component (i.e., the light pulse) and just looking at the raw data that’s received from the sensors.

“You can use that data and some very clever computation to get images that are far better than what a conventional camera can produce. We are capturing the data at a more fundamental level,” Swedish said. “Conceptually, we’re digitizing light directly. This allows us to do a lot more in software, where we would previously have required analog hardware.”

As a deep camera nerd, the approach melted my brain a little (in a good way). The approach means that the worst-case scenario is that the computer vision systems Ubicept has created are as good as conventional cameras. In other words: In perfect lighting with nonmoving targets, the quality of the images and the amount of data that can be garnished from those images should be the same. As soon as the scene shifts toward less-than-ideal image capturing situations (low light, fast-moving targets), the advantage starts shifting toward Ubicept’s tech.

Of course, it isn’t entirely without its drawbacks: SPAD sensors are a bit more expensive than conventional CMOS sensors, and the vast amount of data streamed from the sensors needs to be processed and stored in a way that is useful to the end application.

An inherent and curious advantage of using SPAD sensors is that they have a lot less of what photographers are used to as “digital noise.”

“Every time you read data from a [conventional] sensor chip, the sensor tube itself adds noise. That is one major source of noise when you’re taking an image in low light. The interesting thing is that the SPAD sensors sensors are fundamentally digital: There is zero read noise,” Bauer explained. “That means that you can capture as many of these frames per second as you like, you don’t pay your toll. You can do that 100,000 times per second, or a million times per second.”

The output of this is what the company refers to as a “photon cube” — essentially a three-dimensional timeline of when each photon hit the imaging sensor. Ubicept’s product is the signal processing and computer-vision algorithms that operate to interpret this single-photon data stream.

“What’s interesting about our approach is that you’re constantly streaming data, so we inherently don’t have this issue of, ‘Did I miss the thing that only happens in a brief period of time?’” Swedish explained. “That has a direct impact on being able to improve downstream perception, like pedestrian detection and tracking other kinds of high-level vision applications. This is a shift in thinking.”

In addition to the change of approach, the company has to resolve a few new challenges: Capturing every photon as a raw stream results in a very high-bandwidth firehose of data. The biggest challenge Ubicept is facing, then, is figuring out how to discard the right data and what it needs to keep.

Implications for computer vision

The company has published a number of proofs of concept that shows off what this technology can actually do compared to other computer vision solutions.

The demonstration that caught my eye originally, as a photographer, was the video quality of footage shot out of a moving car:

Of course, the truly impressive thing you get from that is when you run that same video through a computer vision object recognition engine:

The implications for what this does to the field of computer vision may be nontrivial. In a nutshell, it enables industrial, commercial and automation technologies to be an order of magnitude better in low-light and high-speed environments. The company demonstrated what its technology could do in near darkness at 200 mph, with deeply impressive results.

“Vision is so fundamental to seeing and understanding the world for robots and computers. If you’re building a system that moves around, especially near people, it’s really critical that you have a very reliable and robust perception. It’s really important that you understand and see the world,” Swedish said. “We’re not building a consumer-facing product; we’re building a technologies stack that can be integrated with solving end-user cases: robotics, autonomous vehicles, monitoring systems, etc.”

Coming out of stealth at CES, the company has started seeing that there’s a lot of demand for its technology from roboticists that are operating in hard-to-control environments.

“But what that means is planes, helicopters, drones, cars, trucks, off-road vehicles, specialty vehicles and robots,” Bauer ticked off, broadening the use cases for the type of tech they are developing. Automated guided vehicles (AGVs) in particular (such as pick-and-pack robots) may prove to be a beachhead audience.

“AGVs usually operate in a warehouse where you can control the lighting. But sometimes you have to move between warehouses. At night, lighting conditions might be not ideal: You don’t have spotlights and illumination from all directions, you have to deal with the fog and all kinds of other environments. And that’s where we really shine. These uncontrollable environments have low light, or extremely bright light. There is motion, which leads to artifacts. We got a lot of really good customer interest from those industries. This is a paradigm change in imaging.”

The company is currently at the early stage. It just released an evaluation kit that developers can use to experiment with new use cases.

Ubicept’s ultimate goal is to make perception systems work more like the human eye.

“People don’t realize that our retina actually is part of the brain. The retina is the photosensitive part in the backof the eye. It has nerve cells and is technically part of the brain. It is computing things at that layer, before sending it down the optic nerve,” Swedish explained. “The optic nerve then goes into your the GPU and deep learning accelerator in the form of our visual system. Our technology isn’t inspired by that in a literal sense, but mathematically, we’re having to solve the same problem: We have to reduce that high-bandwidth information and descend over the optic nerve. We do that in two stages, via the FGPAs in our evaluation kit.”

The difference between a 60FPS action cam and Ubicept’s camera setup is dramatic.

From the raw data, the company can reconstruct viewable images, but the team suggests that your brain isn’t literally storing the whole field of vision in front of you just to watch television or read the information in a book. If you’re focusing on something, the rest of the world kind of falls away, your brain discarding the information and letting you settle into the thing that’s important.

In time, that’s what Ubicept hopes to be able to do. In different words: It doesn’t matter if the sun is setting or whether the car on your right is blue or orange. If there’s a pedestrian stepping out into the road, your car needs to know that right away and hit the brakes.

“[What we keep and what we discard] is super application-dependent. If you want to have a more general purpose solution, then really, it’s about frame reconstruction,” Bauer explained, waving his hands at the demonstration videos we embedded above. For more specialist use cases, however, the tech can get both smarter and faster than current perception systems.

The first photograph was taken in 1827. In the 196 years since then, photography has focused heavily on the frame and everything that happens in a frame. Ubicept may not ship this in time for the 200th anniversary of the invention of photography, but we may not have to wait much longer before the tech makes it to our pockets.

“In five to 10 years, I think this will be on smartphones,” Bauer concluded, hinting at the vast market the company has ahead of it and the true revolution in photography that might be coming sooner rather than later.