Summarization Dataset

The Data

CORNELL NEWSROOM is a large dataset for training and evaluating summarization systems. It contains 1.3 million articles and summaries written by authors and editors in the newsrooms of 38 major publications. The summaries are obtained from search and social metadata between 1998 and 2017 and use a variety of summarization strategies combining extraction and abstraction.

Paper   Slides  

Getting Started

Use this site to explore the dataset and better understand the task of summarization as used by newsrooms around the Web.

Explore example summaries in the dataset across publications, time, and summarization strategies, analyze overall statistics in the dataset across these categories, and evaluate performance of existing summarization systems trained and tested on the unreleased NEWSROOM data.

The full dataset is available to download online with tools for extracting text and summaries from, analyzing summary extractiveness, and evaluating system performance.

Released Test Leaderboard

We are unable to maintain this table to exhaustively reflect the current state of the art summarization performance on the Newsroom dataset. We recommend consulting Google Scholar or Semantic Scholar for papers recently evaluating using Newsroom. Because of this, we are no longer updating this table.

Date System R-1 R-2 R-L

ABS   Abstractive Systems

EXT   Extractive Systems

MIX   Mixed-Strategy Systems

The leaderboard ranks systems using unstemmed, untokenized ROUGE-1 F-score by default, in order to fully reflect the difficulty of the summarization task, account for generated summary length, and measure performance most comparably across systems.

Visit the evaluate page to explore system performance in depth and read example summaries.

* Evaluation performed in Grusky, et al. 2018.

NEWSROOM was developed as part of the Connected Experiences Lab.
This work is supported by Oath and by a Google Research award.