An English Dataset for Personalized
News Headline Generation Research
PENS is an English dataset for Personalized News Headline Generation Research. It contains two parts for training and test individually. The training set was collected from anonymized user impressions logs of Microsoft News website, and the test set is manually-created by hundreds of native speakers to enable a fair testbed for evaluating models in an offline mode.
PENS contains about 113k English news articles whose topics are distributed into 15 categories and 500k impression logs generated by over 445k users for training. In detail, every news article contains rich textual content including title, body, category and corresponding entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. To provide an offline testbed, we invited 103 English native speakers to manually create a test set by two stages. In detail, there are over 100k personalized news headlines generated.
Notice that each user was de-linked from the production system when securely hashed into an anonymized ID to protect user privacy. For more detailed information about the PENS dataset, you can refer to the following paper:
The PENS dataset is free to download for research purposes under Microsoft Research License Terms. Before you download the dataset, please read these terms first.
This dataset supports research on personalized news headline generation. The training and test set can be download via :
An introduction to the details of PENS dataset, including the statistics and some cases.
We proposed a basic and generic framework for the problem of personalized headline generation.
The code of our work with PENS dataset is provided and will be updated later.
Created with Mobirise - Check it