For a robot to navigate the real world, it needs to perceive the 3D structure of the scene while in motion and continuously estimate the depth of its surroundings. Humans do this effortlessly with stereoscopic vision — the brain’s ability to register a sense of 3D shape and form from visual inputs.
The brain uses the disparity between the views observed by each eye to figure out how far away something is. This requires knowing how each physical point in the scene looks from different perspectives. For robots, this is difficult to determine from raw pixel data obtained with cameras.
To address this problem of depth estimation in robots, researchers Suman Ghosh and Guillermo Gallego from the Technical University of Berlin (TU Berlin) in Germany fuse data from multiple moving cameras in order to generate a 3D map.
The cameras used for this are bio-inspired sensors called “event cameras” whose purpose is to leverage motion information. Often dubbed “silicon retinas”, event cameras mimic the human visual system — similar to the cells in our retina, each pixel in an event camera produces precisely timed, asynchronous outputs called events as opposed to a sequence of image frames generated by traditional cameras.
“Thus, the cameras naturally respond to the moving parts of the scene and to changes in illumination,” said Gallego, head of the Robotic Interactive Perception laboratory at TU Berlin. “This endows event cameras with certain advantages over their traditional, frame-based counterparts, such as a very high dynamic range, resolution in the order of microseconds, low power consumption, and data redundancy suppression.”
In a recently published study in the journal Advanced Intelligent Systems, the researchers reported a means of bringing together spike-based data from multiple moving cameras to generate a coherent 3D map. “Every time the pixel of an event camera generates data, we can use the known camera motion to trace the path of the light ray that triggered it,” said Gallego. “Since the events are generated from the apparent motion of the same 3D structure in the scene, the points where the rays meet give us cues about the location of the 3D points in space.”
With this work, the scientists have extended their idea of “ray fusion” to general, multi-camera setups. “The concept of casting rays to estimate 3D structure has been previously proposed for a single camera, but we wanted to extend it to multiple cameras, like in a stereo setup,” explained Gallego.
“The challenge was figuring out how to efficiently fuse the rays casted from two different cameras in space,” he continued. “We investigated many mathematical functions that could be used for fusing such ray densities. Fusion functions that encourage the ray intersections to be consistent across all cameras produced the best results.”
Helping robots navigate the real world
Ghosh and Gallego tested their stereo fusion algorithm in simulations and several real-world scenes acquired with handheld cameras, cameras on drones and cars, as well as cameras mounted on the heads of people walking, running, skating, and biking. With a comprehensive experimental analysis on diverse indoor and outdoor scenes, they showed that their stereo fusion method outperforms state-of-the-art algorithms, and the difference is most noticeable at a higher-resolution.
The advantages of fusing data from multiple cameras are particularly clear in forward-moving scenes, such as autonomous driving scenarios, where there is very little change in viewing perspective from a single camera. In such cases, the baseline from the additional camera is crucial in producing cleaner depth maps.
“The key is to combine data from two event cameras at an early stage,” said Ph.D. student, Suman Ghosh. “A naive approach would be to combine the final 3D point clouds generated from each camera separately. However, that generates noise and duplicate 3D points in the final output.” In other words, early fusion generates more accurate 3D maps that do not require further post-processing.
Beyond fusing data across multiple cameras, the researchers further extended this idea to fuse data from different time intervals of the same event camera to estimate depth. They showed that such fusion across time can make depth estimation more accurate with less data.
With their approach, they hope to introduce a new way of thinking about the problem beyond pairwise stereo matching. According to Ghosh: “Previous works in event-based stereo depth estimation rely on the precise timing of the data to match them across two cameras. We show that this explicit stereo matching is not needed. Most surprisingly, the data generated from different time intervals can be used to estimate depth maps of similar high quality.”
Minimizing data requirements
Through experiments, the researchers showed that this fusion-based approach scales efficiently to multi-camera (2+) systems. This is a big advantage because with this technique, the benefit of adding extra cameras for robustness does not come at the cost of heavy computational resources.
In the next phase of research, Gallego and Ghosh plan to use the 3D map obtained from this fusion method to find out where and how the camera is moving within the scene. With the complementary 3D mapping and localization systems working simultaneously, a robot can autonomously navigate unknown environments in remote and challenging conditions.
“In the day and age of power-hungry, large-scale artificial intelligence models that are trained on massive amounts of data, it is important to consider computation and environmental costs. We need efficient methods that can run in real-time on mobile robots,” said Ghosh.
With their high efficiency and robustness in fast-moving environments and challenging lighting conditions, event cameras indeed show great promise in this area.
Reference: Suman Ghosh and Guillermo Gallego, Multi-Event-Camera Depth Estimation and Outlier Rejection by Refocused Events Fusion, Advanced Intelligent Systems (2022). DOI:10.1002/aisy.202200221
Feature image: A robot using stereoscopic vision at the Science of Intelligence Excellence Cluster. Credit: Guillermo Gallego