Is Attention All That NeRF Needs?

Mukund Varma T^1*, Peihao Wang^2*, Xuxi Chen², Tianlong Chen², Subhashini Venugopalan³, Zhangyang Wang²

¹Indian Institute of Technology Madras, ²University of Texas at Austin, ³Google Research

^* denotes equal contribution.

This repository is built based on IBRNet's offical repository

News! GNT is accepted at ICLR 2023 🎉. Our updated cross-scene trained checkpoint should generalize to complex scenes, and even achieve comparable results to SOTA per-scene optimized methods without further tuning!
News! Our work was presented by Prof. Atlas in his talk at the MIT Vision and Graphics Seminar on 10/17/22.

Introduction

We present Generalizable NeRF Transformer (GNT), a pure, unified transformer-based architecture that efficiently reconstructs Neural Radiance Fields (NeRFs) on the fly from source views. Unlike prior works on NeRF that optimize a per-scene implicit representation by inverting a handcrafted rendering equation, GNT achieves generalizable neural scene representation and rendering, by encapsulating two transformers-based stages. The first stage of GNT, called view transformer, leverages multi-view geometry as an inductive bias for attention-based scene representation, and predicts coordinate-aligned features by aggregating information from epipolar lines on the neighboring views. The second stage of GNT, named ray transformer, renders novel views by ray marching and directly decodes the sequence of sampled point features using the attention mechanism. Our experiments demonstrate that when optimized on a single scene, GNT can successfully reconstruct NeRF without explicit rendering formula, and even improve the PSNR by ~1.3 dB↑ on complex scenes due to the learnable ray renderer. When trained across various scenes, GNT consistently achieves the state-of-the-art performance when transferring to forward-facing LLFF dataset (LPIPS ~20%↓, SSIM ~25%↑) and synthetic blender dataset (LPIPS ~20%↓, SSIM ~4%↑). In addition, we show that depth and occlusion can be inferred from the learned attention maps, which implies that the pure attention mechanism is capable of learning a physically-grounded rendering process. All these results bring us one step closer to the tantalizing hope of utilizing transformers as the ``universal modeling tool'' even for graphics.

Installation

Clone this repository:

git clone https://github.com/MukundVarmaT/GNT.git
cd GNT/

The code is tested with python 3.8, cuda == 11.1, pytorch == 1.10.1. Additionally dependencies include:

torchvision
ConfigArgParse
imageio
matplotlib
numpy
opencv_contrib_python
Pillow
scipy
imageio-ffmpeg
lpips
scikit-image

Datasets

We reuse the training, evaluation datasets from IBRNet. All datasets must be downloaded to a directory data/ within the project folder and must follow the below organization.

├──data/
    ├──ibrnet_collected_1/
    ├──ibrnet_collected_2/
    ├──real_iconic_noface/
    ├──spaces_dataset/
    ├──RealEstate10K-subset/
    ├──google_scanned_objects/
    ├──nerf_synthetic/
    ├──nerf_llff_data/

We refer to IBRNet's repository to download and prepare data. For ease, we consolidate the instructions below:

mkdir data
cd data/

# IBRNet captures
gdown https://drive.google.com/uc?id=1rkzl3ecL3H0Xxf5WTyc2Swv30RIyr1R_
unzip ibrnet_collected.zip

# LLFF
gdown https://drive.google.com/uc?id=1ThgjloNt58ZdnEuiCeRf9tATJ-HI0b01
unzip real_iconic_noface.zip

## [IMPORTANT] remove scenes that appear in the test set
cd real_iconic_noface/
rm -rf data2_fernvlsb data2_hugetrike data2_trexsanta data3_orchid data5_leafscene data5_lotr data5_redflower
cd ../

# Spaces dataset
git clone https://github.com/augmentedperception/spaces_dataset

# RealEstate 10k
## make sure to install ffmpeg - sudo apt-get install ffmpeg
git clone https://github.com/qianqianwang68/RealEstate10K_Downloader
cd RealEstate10K_Downloader
python3 generate_dataset.py train
cd ../

# Google Scanned Objects
gdown https://drive.google.com/uc?id=1w1Cs0yztH6kE3JIz7mdggvPGCwIKkVi2
unzip google_scanned_objects_renderings.zip

# Blender dataset
gdown https://drive.google.com/uc?id=18JxhpWD-4ZmuFKLzKlAw-w5PpzZxXOcG
unzip nerf_synthetic.zip

# LLFF dataset (eval)
gdown https://drive.google.com/uc?id=16VnMcF1KJYxN9QId6TClMsZRahHNMW5g
unzip nerf_llff_data.zip

Usage

Training

# single scene
# python3 train.py --config <config> --train_scenes <scene> --eval_scenes <scene> --optional[other kwargs]. Example:
python3 train.py --config configs/gnt_blender.txt --train_scenes drums --eval_scenes drums
python3 train.py --config configs/gnt_llff.txt --train_scenes orchids --eval_scenes orchids

# cross scene
# python3 train.py --config <config> --optional[other kwargs]. Example:
python3 train.py --config configs/gnt_full.txt

To decode coarse-fine outputs set --N_importance > 0, and with a separate fine network use --single_net = False

Pre-trained Models

Dataset	Scene	Download
LLFF	fern	ckpt	renders
	flower	ckpt	renders
	fortress	ckpt	renders
	horns	ckpt	renders
	leaves	ckpt	renders
	orchids	ckpt	renders
	room	ckpt	renders
	trex	ckpt	renders
Synthetic	chair	ckpt	renders
	drums	ckpt	renders
	ficus	ckpt	renders
	hotdog	ckpt	renders
	lego	ckpt	renders
	materials	ckpt	renders
	mic	ckpt	renders
	ship	ckpt	renders
generalization	N.A.	ckpt	renders

To reuse pretrained models, download the required checkpoints and place in appropriate directory with name - gnt_<scene-name> (single scene) or gnt_<full> (generalization). Then proceed to evaluation / rendering. To facilitate future research, we also provide half resolution renderings of our method on several benchmark scenes. Incase there are issues with any of the above checkpoints, please feel free to open an issue.

Evaluation

# single scene
# python3 eval.py --config <config> --eval_scenes <scene> --expname <out-dir> --run_val --optional[other kwargs]. Example:
python3 eval.py --config configs/gnt_llff.txt --eval_scenes orchids --expname gnt_orchids --chunk_size 500 --run_val --N_samples 192
python3 eval.py --config configs/gnt_blender.txt --eval_scenes drums --expname gnt_drums --chunk_size 500 --run_val --N_samples 192

# cross scene
# python3 eval.py --config <config> --expname <out-dir> --run_val --optional[other kwargs]. Example:
python3 eval.py --config configs/gnt_full.txt --expname gnt_full --chunk_size 500 --run_val --N_samples 192

Rendering

To render videos of smooth camera paths for the real forward-facing scenes.

# python3 render.py --config <config> --eval_dataset llff_render --eval_scenes <scene> --expname <out-dir> --optional[other kwargs]. Example:
python3 render.py --config configs/gnt_llff.txt --eval_dataset llff_render --eval_scenes orchids --expname gnt_orchids --chunk_size 500 --N_samples 192

The code has been recently tidied up for release and could perhaps contain tiny bugs. Please feel free to open an issue.

Cite this work

If you find our work / code implementation useful for your own research, please cite our paper.

@inproceedings{
    t2023is,
    title={Is Attention All That Ne{RF} Needs?},
    author={Mukund Varma T and Peihao Wang and Xuxi Chen and Tianlong Chen and Subhashini Venugopalan and Zhangyang Wang},
    booktitle={The Eleventh International Conference on Learning Representations },
    year={2023},
    url={https://openreview.net/forum?id=xE-LtsE-xx}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
configs		configs
docs		docs
gnt		gnt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
eval.py		eval.py
render.py		render.py
train.py		train.py
utils.py		utils.py

License

VITA-Group/GNT

Folders and files

Latest commit

History

Repository files navigation

Is Attention All That NeRF Needs?

Introduction

Installation

Datasets

Usage

Training

Pre-trained Models

Evaluation

Rendering

Cite this work

About

Topics

Resources

License

Stars

Watchers

Forks

Languages