Computer Vision is a field of Artificial Intelligence that enables computers to represent the visual world. Deep Learning has revolutionized this field thanks to neural networks that can learn from data how to make accurate predictions. Recent progress promises to make cars safer, increase freedom to move through automated vehicles, and eventually provide robotic assistance for those with disabilities and for our rapidly aging global population.
However, there is a catch. Beyond privacy and other ethical issues to carefully consider when designing machine learning systems, all state-of-the-art models in computer vision rely on millions of labels (or more!) to reach the high level of accuracy required for safety-critical applications in the real world. Manual labeling is expensive and time-consuming, taking hours and costing tens of dollars per image. And, sometimes, it is impossible altogether.
This is the case for monocular depth estimation, where the goal is to help the computer understand the depth of images and predict how far scene elements are for each pixel of a single image.