NLVR

Natural Language for Visual Reasoning

This page is no longer being updated.

The leaderboards have not been updated for a while, and will not be udpated. Please consult recent publications for the state of the art. The hidden test sets have been publicly released.

Cornell Natural Language for Visual Reasoning

The Natural Language for Visual Reasoning corpora are two language grounding datasets containing natural language sentences grounded in images. The task is to determine whether a sentence is true about a visual input. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. This includes two corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.


Natural Language for Visual Reasoning for Real

NLVR2 contains 107,292 examples of human-written English sentences grounded in pairs of photographs. NLVR2 retains the linguistic diversity of NLVR, while including much more visually complex images.


Data Paper (Suhr et al. 2019)


The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing.
true

One image shows exactly two brown acorns in back-to-back caps on green foliage.
false

Image Credit (in order left-right, top-bottom): MemoryCatcher (CC0), Calabash13 (CC BY-SA 3.0), Charles Rondeau (CC0), Andale (CC0).


We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service.


Natural Language for Visual Reasoning

NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthically generated, this dataset can be used for semantic parsing.


Data Paper (Suhr et al. 2017)


There is exactly one black triangle not touching any edge
true

there is at least one tower with four blocks with a yellow block at the base and a blue block below the top block
true

There is a box with multiple items and only one item has a different color.
false

There is exactly one tower with a blue block at the base and yellow block at the top
false

More examples (from the development set) are available here.



Leaderboards

Update: as of August 18, 2022, both the NLVR and NLVR2 hidden test sets are released to the public, and we will no longer be taking requests to run on the hidden test set. For NLVR, all of the data is available on Github. For NLVR2, only sentences, labels, and image URLs are available on Github. If you would like direct access to the images, please fill out the Google Form.



Questions?

Please visit our Github issues page or email us at

nlvr < at > googlegroups.com

To keep up to date with major changes, please subscribe:



Acknowledgments

This research was supported by the NSF (CRII-1656998), a Facebook ParlAI Research Award, an AI2 Key Scientific Challenges Award, Amazon Cloud Credits Grant, and support from Women in Technology New York. This material is based on work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. We thank Mark Yatskar and Noah Snavely for their comments and suggestions, and the workers who participated in our data collection for their contributions.


Also thanks to SQuAD for allowing us to use their code to create this website!

NLVR2 Leaderboard

NLVR2 presents the task of determining whether a natural language sentence is true about a pair of photographs. As of August 18, 2022, the NLVR2 hidden test set is released to the public, and we will no longer be taking requests to run on the hidden test set. You can find the sentences, labels, and image URLs for the hidden test set in the Github repository. If you would like direct access to the images, please fill out the Google Form.

Rank Model Dev. (Acc) Test-P (Acc) Test-U (Acc) Test-U (Cons)
Human Performance

Cornell University

(Suhr et al. 2019)
96.2 96.3 96.1 -

1

Oct 14, 2019
UNITER

Microsoft Dynamics 365 AI Research

(Chen et al. 2019)
78.4 79.5 80.4 50.8

2

Aug 20, 2019
LXMERT

UNC Chapel Hill

(Tan and Bansal 2019)
74.9 74.5 76.2 42.1

3

Aug 11, 2019
VisualBERT

UCLA & AI2 & PKU

(Li et al. 2019)
67.4 67.0 67.3 26.9

4

Nov 1, 2018
MaxEnt

Cornell University

(Suhr et al. 2019)
54.1 54.8 53.5 12.0

5

Nov 1, 2018
CNN+RNN

Cornell University

(Suhr et al. 2019)
53.4 52.4 53.2 11.2

6

Nov 1, 2018
FiLM

MILA, ran by Cornell University

(Perez et al. 2018)
51.0 52.1 53.0 10.6

7

Nov 1, 2018
Image Only (CNN)

Cornell University

(Suhr et al. 2019)
51.6 51.9 51.9 7.1

8

Nov 1, 2018
N2NMN, policy search from scratch

UC Berkeley, ran by Cornell University

(Hu et al. 2017)
51.0 51.1 51.5 5.0

9

Nov 1, 2018
Majority Class

Cornell University

(Suhr et al. 2019)
50.9 51.1 51.4 4.6

10

Nov 1, 2018
Text Only (RNN)

Cornell University

(Suhr et al. 2019)
50.9 51.1 51.4 4.6

11

Nov 1, 2018
MAC-Network

Stanford University, ran by Cornell University

(Hudson and Manning 2018)
50.8 51.4 51.2 11.2

NLVR Leaderboard

NLVR presents the task of determining whether a natural language sentence is true about a synthetically generated image. We divide results into whether they process the image pixels directly (Images) or whether they use the structured representations of the images (Structured Representations). As of July 29, 2022, the NLVR hidden test set (Test-U) is released to the public, and we will no longer be taking requests to run on the hidden test set. You can find the data, including sentences, labels, and images, in the Github repository.

Images

Rank Model Dev. (Acc) Test-P (Acc) Test-U (Acc) Test-U (Cons)
Human Performance

Cornell University

94.6 95.4 94.9 -

1

Apr 20, 2018
CNN-BiATT

UNC Chapel Hill

(Tan and Bansal 2018)
66.9 69.7 66.1 28.9

2

Nov 1, 2018
N2NMN, policy search from scratch

UC Berkeley, ran by Cornell University

(Hu et al. 2017)
65.3 69.1 66.0 17.7

3

Apr 22, 2017
Neural Module Networks

UC Berkeley, ran by Cornell University

(Andreas et al. 2016)
63.1 66.1 62.0 -

4

Nov 1, 2018
FiLM

MILA, ran by Cornell University

(Perez et al. 2018)
60.1 62.2 61.2 18.1

5

Apr 22, 2017
Majority Class

Cornell University

(Suhr et al. 2017)
55.3 56.2 55.4 -

6

Nov 1, 2018
MAC-Network

Stanford University, ran by Cornell University

(Hudson and Manning 2018)
55.4 57.6 54.3 8.6

Unranked

Sept 7, 2018
CMM

Chinese Academy of Sciences

(Yao et al. 2018)
68.0 69.9 - -

Unranked

May 23, 2018
W-MemNN

Federico Santa María Technical University & Pontífica Universidad Católica de Valparaíso

(Pavez et al. 2018)
65.6 65.8 - -

Structured Representations

Rank Model Dev. (Acc) Test-P (Acc) Test-U (Acc) Test-U (Cons)
Human Performance

Cornell University

94.6 95.4 94.9 -

1

July 14, 2021
Consistency-based Parser

UPenn; UC Irvine; AI2

(Gupta et al. 2021)
89.6 86.3 89.5 74.0

2

June 2, 2019
Iterative Search

AI2; Mila; UW; CMU

(Dasigi et al. 2019)
85.4 82.4 82.9 64.3

3

Nov 14, 2017
AbsTAU

Tel-Aviv University

(Goldman et al. 2018)
85.7 84.0 82.5 63.9

4

Apr 4, 2018
BiATT-Pointer

UNC Chapel Hill

(Tan and Bansal 2018)
74.6 73.9 71.8 37.2

5

Apr 22, 2017
MaxEnt

Cornell University

(Suhr et al. 2017)
68.0 67.7 67.8 -

6

Apr 22, 2017
Majority Class

Cornell University

(Suhr et al. 2017)
55.3 56.2 55.4 -