NLVR |
The Natural Language for Visual Reasoning corpora are two language grounding datasets containing natural language sentences grounded in images. The task is to determine whether a sentence is true about a visual input. The data was collected through crowdsourcings, and solving the task requires reasoning about sets of objects, comparisons, and spatial relations. This includes two corpora: NLVR, with synthetically generated images, and NLVR2, which includes natural photographs.
NLVR2 contains 107,292 examples of human-written English sentences grounded in pairs of photographs. NLVR2 retains the linguistic diversity of NLVR, while including much more visually complex images.
We only publicly release the sentence annotations and original image URLs, and scripts that download the images from the URLs. If you would like direct access to the images, please fill out this Google Form. This form asks for your basic information and asks you to agree to our Terms of Service.
NLVR contains 92,244 pairs of human-written English sentences grounded in synthetic images. Because the images are synthically generated, this dataset can be used for semantic parsing.
More examples (from the development set) are available here.
Both NLVR and NLVR2 are split into training, development, and two test sets. One test set is public (Test-P) and available with the data, and the other is not released (Test-U). We maintain a leaderboard displaying accuracy and consistency on the unreleased test set (as well as accuracy on the development and public test sets). Results are ordered by accuracy on the unreleased test set; ties are broken with consistency.
We require two months or more between runs on each leaderboard test set. We will do our best to run within two weeks (usually we will run much faster). We will only post results on the leaderboard when an online description of the system is available. Testing on the leaderboard test set is meant to be the final step before publication. Under extreme circumstances, we reserve the right to limit running on the leaderboard test set to systems that are mature for publication. Your model should generate a prediction file in the format specified in the NLVR readme and run with the provided evaluation scripts. You can request to add your model to the leaderboard even if you don't evaluate on the unreleased test set.
For both datasets, we use two evaluation metrics: accuracy and consistency. Accuracy (Acc) is computed as the proportion of examples (sentence-image pairs) for which a model correctly predicted a truth value. Consistency (Cons) measures the generalization of a model. It is computed as the proportion of unique sentences for which a model correctly predicted the truth value for all paired images (Goldman et al., 2018).
Please visit our Github issues page or email us at
Please email us if you wish to run on an unreleased test set. To keep up to date with major changes, please subscribe:
This research was supported by the NSF (CRII-1656998), a Facebook ParlAI Research Award, an AI2 Key Scientific Challenges Award, Amazon Cloud Credits Grant, and support from Women in Technology New York. This material is based on work supported by the National Science Foundation Graduate Research Fellowship under Grant No. DGE-1650441. We thank Mark Yatskar and Noah Snavely for their comments and suggestions, and the workers who participated in our data collection for their contributions.
Also thanks to SQuAD for allowing us to use their code to create this website!
NLVR2 presents the task of determining whether a natural language sentence is true about a pair of photographs.
Rank | Model | Dev. (Acc) | Test-P (Acc) | Test-U (Acc) | Test-U (Cons) |
---|---|---|---|---|---|
Human Performance Cornell University (Suhr et al. 2019) |
96.2 | 96.3 | 96.1 | - | |
1 Oct 14, 2019 |
UNITER Microsoft Dynamics 365 AI Research (Chen et al. 2019) |
78.4 | 79.5 | 80.4 | 50.8 |
2 Aug 20, 2019 |
LXMERT UNC Chapel Hill (Tan and Bansal 2019) |
74.9 | 74.5 | 76.2 | 42.1 |
3 Aug 11, 2019 |
VisualBERT UCLA & AI2 & PKU (Li et al. 2019) |
67.4 | 67.0 | 67.3 | 26.9 |
4 Nov 1, 2018 |
MaxEnt Cornell University (Suhr et al. 2019) |
54.1 | 54.8 | 53.5 | 12.0 |
5 Nov 1, 2018 |
CNN+RNN Cornell University (Suhr et al. 2019) |
53.4 | 52.4 | 53.2 | 11.2 |
6 Nov 1, 2018 |
FiLM MILA, ran by Cornell University (Perez et al. 2018) |
51.0 | 52.1 | 53.0 | 10.6 |
7 Nov 1, 2018 |
Image Only (CNN) Cornell University (Suhr et al. 2019) |
51.6 | 51.9 | 51.9 | 7.1 |
8 Nov 1, 2018 |
N2NMN, policy search from scratch UC Berkeley, ran by Cornell University (Hu et al. 2017) |
51.0 | 51.1 | 51.5 | 5.0 |
9 Nov 1, 2018 |
Majority Class Cornell University (Suhr et al. 2019) |
50.9 | 51.1 | 51.4 | 4.6 |
10 Nov 1, 2018 |
Text Only (RNN) Cornell University (Suhr et al. 2019) |
50.9 | 51.1 | 51.4 | 4.6 |
11 Nov 1, 2018 |
MAC-Network Stanford University, ran by Cornell University (Hudson and Manning 2018) |
50.8 | 51.4 | 51.2 | 11.2 |
NLVR presents the task of determining whether a natural language sentence is true about a synthetically generated image. We divide results into whether they process the image pixels directly (Images) or whether they use the structured representations of the images (Structured Representations).
Rank | Model | Dev. (Acc) | Test-P (Acc) | Test-U (Acc) | Test-U (Cons) |
---|---|---|---|---|---|
Human Performance Cornell University |
94.6 | 95.4 | 94.9 | - | |
1 Apr 20, 2018 |
CNN-BiATT UNC Chapel Hill (Tan and Bansal 2018) |
66.9 | 69.7 | 66.1 | 28.9 |
2 Nov 1, 2018 |
N2NMN, policy search from scratch UC Berkeley, ran by Cornell University (Hu et al. 2017) |
65.3 | 69.1 | 66.0 | 17.7 |
3 Apr 22, 2017 |
Neural Module Networks UC Berkeley, ran by Cornell University (Andreas et al. 2016) |
63.1 | 66.1 | 62.0 | - |
4 Nov 1, 2018 |
FiLM MILA, ran by Cornell University (Perez et al. 2018) |
60.1 | 62.2 | 61.2 | 18.1 |
5 Apr 22, 2017 |
Majority Class Cornell University (Suhr et al. 2017) |
55.3 | 56.2 | 55.4 | - |
6 Nov 1, 2018 |
MAC-Network Stanford University, ran by Cornell University (Hudson and Manning 2018) |
55.4 | 57.6 | 54.3 | 8.6 |
Unranked Sept 7, 2018 |
CMM Chinese Academy of Sciences (Yao et al. 2018) |
68.0 | 69.9 | - | - |
Unranked May 23, 2018 |
W-MemNN Federico Santa María Technical University & Pontífica Universidad Católica de Valparaíso (Pavez et al. 2018) |
65.6 | 65.8 | - | - |
Rank | Model | Dev. (Acc) | Test-P (Acc) | Test-U (Acc) | Test-U (Cons) |
---|---|---|---|---|---|
Human Performance Cornell University |
94.6 | 95.4 | 94.9 | - | |
1 Nov 14, 2017 |
AbsTAU Tel-Aviv University (Goldman et al. 2018) |
85.7 | 84.0 | 82.5 | 63.9 |
2 Apr 4, 2018 |
BiATT-Pointer UNC Chapel Hill (Tan and Bansal 2018) |
74.6 | 73.9 | 71.8 | 37.2 |
3 Apr 22, 2017 |
MaxEnt Cornell University (Suhr et al. 2017) |
68.0 | 67.7 | 67.8 | - |
4 Apr 22, 2017 |
Majority Class Cornell University (Suhr et al. 2017) |
55.3 | 56.2 | 55.4 | - |