MELON
NeRF with Unposed Images Using Equivalence Class Estimation

1: Stanford University - 2: Google
*Indicates Equal Contribution

Convergence of MELON on "lego" scene from unposed images.
Left - Novel views throughout optimization.
Right - Predicted view directions (projected azimuth/elevation).

Abstract

Neural radiance fields enable novel-view synthesis and scene reconstruction with photorealistic quality from a few images, but require known and accurate camera poses. Conventional pose estimation algorithms fail on smooth or self-similar scenes, while methods performing inverse rendering from unposed views require a rough initialization of the camera orientations. The main difficulty of pose estimation lies in real-life objects being almost invariant under certain transformations, making the photometric distance between rendered views non-convex with respect to the camera parameters. Using an equivalence relation that matches the distribution of local minima in camera space, we reduce this space to its quotient set, in which pose estimation becomes a more convex problem. Using a neural-network to regularize pose estimation, we demonstrate that our method - MELON - can reconstruct a neural radiance field from unposed images with state-of-the-art accuracy while requiring ten times fewer views than adversarial approaches.

Methods

teaser
MELON simultaneously trains a CNN encoder that maps images to camera pose in SO(3), and a neural radiance field of the scene. It does not require CNN pre-training, and is able to infer camera pose in object centered configurations entirely "ab-initio", without any pose initialization whatsoever. To cope with the presence of local minima in the low-dimensional latent space SO(3), we introduce a novel modulo loss that replicates the encoder pose output and only backpropagates the view with the lowest photometric error.

Results

NeRF from Unposed Images

results

MELON performs state-of-the-art novel views synthesis on synthetic datasets of unposed images.

Reconstruction from Small Datasets

results

Contrary to adversarial approaches, our method works on datasets containing few images. "GT+NeRF" trains a NeRF with ground truth camera poses.

Real Datasets

results

Ab initio reconstruction on real datasets. All the methods use ground truth values for the object-to-camera distances and in-plane camera translations. SAMURAI* uses a fixed initialization of the poses at the North pole.

Reconstruction from Noisy Datasets

results

MELON is robust to high levels of noise.

One-dimensional Toy Problem

one dimensional toy problem

We are given a set of crops from an unknown 1D function. The goal is to recover the function (and the angles) given the crops.

one dimensional toy results qualitative

Example of ground truth function and 1D reconstructions.

one dimensional toy results quantitative

Angular and reconstruction error for the 1D datasets. We indicate the mean, min and max errors obtained throughout 10 experiments. The explicit representation gets stuck at an early stage. The modulo loss helps the model avoiding local minima and the encoder regularizes the angular prediction.

RGB-MELON Dataset

We build a pathological dataset containing centered views of 3D spheres on which we map almost-symmetric textures. We use spherical harmonics to generate red--green textures that are invariant by translation along the azimuthal direction. In order to break the perfect symmetry of these scenes, we add three red/green/blue squares along the azimuthal direction. This dataset can be used as a minimalist but challenging example for pose estimation and inverse rendering.

Presentation Video

<!– Youtube embed code here –>

BibTeX

@article{levy2023melon,
  author    = {Levy, Axel and Matthews, Mark and Sela, Matan and Wetzstein, Gordon and Lagun, Dmitry},
  title     = {{MELON}: NeRF with Unposed Images Using Equivalence Class Estimation},
  journal   = {arXiv:preprint},
  year      = {2023},
}