Skip to content

This project is a reimplementation of the paper “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” by Eigen et al., . The work explores how deep convolutional neural networks can infer dense depth maps from single RGB images an inherently ill-posed problem due to the lack of explicit depth information.

Notifications You must be signed in to change notification settings

mostafabahaa25/depth_map_prediction_from_single_image

Repository files navigation

Depth map prediction from single image

This project is a reimplementation of the paper “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” by Eigen et al., 2014 (arXiv link). The work explores how deep convolutional neural networks can infer dense depth maps from single RGB images — an inherently ill-posed problem due to the lack of explicit depth information.


loss


Table of contents


Overview

The goal of this project is to reproduce the key results and methodology from the paper, focusing on the multi-scale CNN architecture and the scale-invariant loss proposed by the authors. The model estimates the depth for each pixel, capturing both global scene layout and local geometric details through a two-stage learning process.


Model Architecture

1. Coarse (Global-Level) Network

The coarse network captures the global scene structure from the input image. It is pretrained on ImageNet for improved feature extraction.

Layers:

  • Conv (11×11, stride 4, 96 filters) → ReLU
  • MaxPool (2×2)
  • Conv (5×5, 256 filters) → ReLU
  • MaxPool (2×2)
  • Conv (3×3, 384 filters) → ReLU
  • Conv (3×3, 384 filters) → ReLU
  • Conv (3×3, 256 filters) → ReLU
  • Fully Connected (4096 units) → ReLU
  • Dropout
  • Fully Connected (1 unit, linear)

2. Fine (Local-Level) Network

The fine network refines the output of the coarse model by adding local image details and edge smoothness.

Layers:

  • Conv (9×9, stride 2, 63 filters) → ReLU
  • MaxPool (2×2)
  • Concatenate with coarse network output
  • Conv (5×5, 64 filters) → ReLU
  • Conv (5×5, 1 filter) → Linear

Training Strategy: The coarse network is trained first; the fine network is trained afterward using the coarse outputs as input.


Loss Function

A key challenge in monocular depth estimation is the ambiguity of absolute scale. Since the model is trained on a single RGB image, it lacks true depth information, leading to infinitely many possible depth solutions that can explain the same image. In other words, the network can predict depth maps with correct relative structure but arbitrary global scale, as illustrated in the example below.

loss

Training Loss and Evaluation Metric

Eval metric:
$\frac{1}{n}\sum_{i=1}^n d_i^2 - \frac{1}{n^2}(\sum_{i=1}^n d_i)^2$


Training Loss:
$\frac{1}{n}\sum_{i=1}^n d_i^2 - \frac{\lambda}{n^2}(\sum_{i=1}^n d_i)^2$

where:

  • $y$ is the predicted depth map

  • $y^*$ is the ground truth depth map

  • $d_i$ = $log(y_i) - log(y_{i}^{*})$

  • $n$ is the number of pixels in the output depth map and ground truth depth map

  • $λ$ is a hyper-parameter $\in$ [0, 1]

Note: you might need to exclude pixel values that have infinte depth from both metrics

The above loss tries to keep the difference between pixels in the same map as close as possible in the predicted map.


loss




In other words, it tries to make the difference between the green pixels the same as the difference between the red pixels ---

Data Augmentation

To improve generalization, the following augmentations are applied during training:

  • Scaling: Random factor ( s \in [1, 1.5] ); depths divided by ( s )
  • Rotation: Random rotation ( r \in [-5°, 5°] )
  • Translation: Random cropping to fixed target size
  • Color Adjustment: Global RGB scaling ( c \in [0.8, 1.2]^3 )
  • Flipping: Horizontal flip with probability 0.5

Implementation Details

  • Framework: PyTorch
  • Pretraining: Coarse CNN initialized with ImageNet weights
  • Training Order: Coarse network → Fine network
  • Loss: Scale-invariant depth loss
  • Dataset: NYU Depth v2 (Indoor Scenes)

Results

The reproduced model demonstrates the effectiveness of multi-scale feature learning in monocular depth estimation. The coarse network provides a globally consistent depth layout, while the fine network refines edges and surface details.

(Visual results and quantitative metrics can be added here once available.)


References

  • D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network,” NeurIPS 2014.
  • NYU Depth v2 Dataset — Official Site

About

This project is a reimplementation of the paper “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” by Eigen et al., . The work explores how deep convolutional neural networks can infer dense depth maps from single RGB images an inherently ill-posed problem due to the lack of explicit depth information.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages