Summarization Dataset

The Data

CORNELL NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction.

Paper   Slides  

Getting Started

Use this site to explore the dataset and better understand the task of summarization as used by newsrooms around the Web.

Explore example summaries in the dataset across publications, time, and summarization strategies, analyze overall statistics in the dataset across these categories, and evaluate performance of existing summarization systems trained and tested on the unreleased NEWSROOM data.

The full dataset is available to download online with tools for extracting text and summaries from, analyzing summary extractiveness, and evaluating system performance.

Unreleased Test Leaderboard

Date System R-1 R-2 R-L HUM

Released Test Leaderboard

Please email us your test results to add your work here. This listing is not exhaustive.

Date System R-1 R-2 R-L

ABS   Abstractive Systems

EXT   Extractive Systems

MIX   Mixed-Strategy Systems

The leaderboard ranks systems using unstemmed, untokenized ROUGE-1 F-score by default, in order to fully reflect the difficulty of the summarization task, account for generated summary length, and measure performance most comparably across systems. Explore other stemming, tokenization, and ROUGE score variants above. HUM is the composite score of the NEWSROOM human evaluation task.

Visit the evaluate page to explore system performance in depth and read example summaries.

* Evaluation performed in Grusky, et al. 2018.

NEWSROOM was developed as part of the Connected Experiences Lab.
This work is supported by Oath and by a Google Research award.