Newsroom

Summarization Dataset

Download the Data

The data is available in two forms: the complete text data and a list of Archive.org URLs to scrape. Downloading the complete data requires accepting the data licensing terms. Please submit the form here. Shotly after filling the form and agreeing to the license you will receive a download link to the email you provided. The download includes the complete training, development, and released test data splits. You can also scrape the data yourself using Archive.org URLs ("thin dataset") and the data builder scripts available on Github.

Format

CORNELL NEWSROOM contains three large files for training, development, and released test sets. Each of these files uses the compressed JSON line format. Each line is an object representing a single article-summary pair. An example summary object:

{
            "text": "...",
         "summary": "...",
           "title": "...",
         "archive": "http://...",
            "date": 20160302060024,
         "density": 1.25,
        "coverage": 0.75,
     "compression": 12.5,
 "compression_bin": "medium",
    "coverage_bin": "low",
     "density_bin": "abstractive"
}

The date is an integer using the Internet Archive date format: YYYYMMDDHHMMSS. Density and coverage scores are provided for convenience, computed using the summary analysis tool also provided. Data subset and subsets by density, coverage, and compression are also provided. For example, in Python, each data file can be read as follows:

import json, gz

path = "train.jsonl.gz"
data = []

with gz.open(path) as f:
    for ln in f:
        obj = json.loads(ln)
        data.append(obj)