Depth map prediction from single image

This project is a reimplementation of the paper “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” by Eigen et al., 2014 (arXiv link). The work explores how deep convolutional neural networks can infer dense depth maps from single RGB images — an inherently ill-posed problem due to the lack of explicit depth information.

The goal of this project is to reproduce the key results and methodology from the paper, focusing on the multi-scale CNN architecture and the scale-invariant loss proposed by the authors. The model estimates the depth for each pixel, capturing both global scene layout and local geometric details through a two-stage learning process.

Model Architecture

1. Coarse (Global-Level) Network

The coarse network captures the global scene structure from the input image. It is pretrained on ImageNet for improved feature extraction.

Layers:

Conv (11×11, stride 4, 96 filters) → ReLU
MaxPool (2×2)
Conv (5×5, 256 filters) → ReLU
MaxPool (2×2)
Conv (3×3, 384 filters) → ReLU
Conv (3×3, 384 filters) → ReLU
Conv (3×3, 256 filters) → ReLU
Fully Connected (4096 units) → ReLU
Dropout
Fully Connected (1 unit, linear)

2. Fine (Local-Level) Network

The fine network refines the output of the coarse model by adding local image details and edge smoothness.

Layers:

Conv (9×9, stride 2, 63 filters) → ReLU
MaxPool (2×2)
Concatenate with coarse network output
Conv (5×5, 64 filters) → ReLU
Conv (5×5, 1 filter) → Linear

Training Strategy: The coarse network is trained first; the fine network is trained afterward using the coarse outputs as input.

Loss Function

A key challenge in monocular depth estimation is the ambiguity of absolute scale. Since the model is trained on a single RGB image, it lacks true depth information, leading to infinitely many possible depth solutions that can explain the same image. In other words, the network can predict depth maps with correct relative structure but arbitrary global scale, as illustrated in the example below.

Training Loss and Evaluation Metric

Eval metric:
$\frac{1}{n}\sum_{i=1}^n d_i^2 - \frac{1}{n^2}(\sum_{i=1}^n d_i)^2$

Training Loss:
$\frac{1}{n}\sum_{i=1}^n d_i^2 - \frac{\lambda}{n^2}(\sum_{i=1}^n d_i)^2$

where:

$y$ is the predicted depth map
$y^*$ is the ground truth depth map
$d_i$ = $log(y_i) - log(y_{i}^{*})$
$n$ is the number of pixels in the output depth map and ground truth depth map
$λ$ is a hyper-parameter $\in$ [0, 1]

Note: you might need to exclude pixel values that have infinte depth from both metrics

The above loss tries to keep the difference between pixels in the same map as close as possible in the predicted map.

In other words, it tries to make the difference between the green pixels the same as the difference between the red pixels ---

Data Augmentation

To improve generalization, the following augmentations are applied during training:

Scaling: Random factor ( s \in [1, 1.5] ); depths divided by ( s )
Rotation: Random rotation ( r \in [-5°, 5°] )
Translation: Random cropping to fixed target size
Color Adjustment: Global RGB scaling ( c \in [0.8, 1.2]^3 )
Flipping: Horizontal flip with probability 0.5

Implementation Details

Framework: PyTorch
Pretraining: Coarse CNN initialized with ImageNet weights
Training Order: Coarse network → Fine network
Loss: Scale-invariant depth loss
Dataset: NYU Depth v2 (Indoor Scenes)

Results

The reproduced model demonstrates the effectiveness of multi-scale feature learning in monocular depth estimation. The coarse network provides a globally consistent depth layout, while the fine network refines edges and surface details.

(Visual results and quantitative metrics can be added here once available.)

References

D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network,” NeurIPS 2014.
NYU Depth v2 Dataset — Official Site

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
depth-map-prediction-from-single-image		depth-map-prediction-from-single-image
.gitattributes		.gitattributes
README.md		README.md
depth-map-prediction-from-single-image.ipynb		depth-map-prediction-from-single-image.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Depth map prediction from single image

Table of contents

Overview

Model Architecture

1. Coarse (Global-Level) Network

2. Fine (Local-Level) Network

Loss Function

Training Loss and Evaluation Metric

Data Augmentation

Implementation Details

Results

References

About

Uh oh!

Releases

Packages

Languages

mostafabahaa25/depth_map_prediction_from_single_image

Folders and files

Latest commit

History

Repository files navigation

Depth map prediction from single image

Table of contents

Overview

Model Architecture

1. Coarse (Global-Level) Network

2. Fine (Local-Level) Network

Loss Function

Training Loss and Evaluation Metric

Data Augmentation

Implementation Details

Results

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages