PENS Dataset

The PENS dataset contains 113,762 pieces of News whose topics are distributed into 15 categories. Each news includes a news ID, a title, a body and a category manually tagged by editors. The average length of news title and news body is 10.5 and 549.0, individually. Entities from each news title are extracted and then linked to those in WikiData.

We sample 500, 000 user-news impressions from June 13, 2019, to July 3, 2019, as the training set. An impression log records the news articles displayed to a user as well as the click behaviors on these news articles when he/she visits the news website at a specific time. The format of each labeled sample in our training set is [uID, tmp, clkNews, uclkNews, clkedHis], where uID indicates the anonymous ID of a user, tmp denotes the timestamp of this impression record. clkNews and uclkNews are the clicked news and un-clicked news in this impression, respectively. clkedHis represents the news articles previously clicked by this user. All the samples in clkNews, uclkNews and clkedHis are sorted by the user’s click time.

File Name	Description
news.tsv	The information of news articles
train.tsv	The click histories and impression logs of users for training
valid.tsv	The click histories and impression logs of users for validation

File Name

Description

news.tsv

The information of news articles

train.tsv

The click histories and impression logs of users for training

valid.tsv

The click histories and impression logs of users for validation

Column	Example Context	Description
News ID	N10000	Unique ID of news
Category	sports	Belong to one of 15 categories
Topic	soccer	Specific topic of news
Headline	Predicting Atlanta United's lineup against Columbus Crew in the U.S. Open Cup
News body	Only FIVE internationals allowed, count em, FIVE! So first off we should say, per our usual Atlanta United lineup predictions, this will be wrong...
Title entity	{"Atlanta United's": 'Atlanta United FC'}	The mapping between the phrase in title and the entity in wikidata
Entity content	{'Atlanta United FC': { 'type': 'item', 'id': 'Q16836317', 'labels': {'en': {'language': 'en', 'value': 'Atlanta United FC'}, ...}, 'descriptions': {'en': {'language': 'en', 'value': 'Football team in the city of Atlanta, Georgia, United States'}, ...}, 'aliases': {'en': [{'language': 'en', 'value': 'Atlanta United'}, {'language': 'en', 'value': 'ATL UTD'}, {'language': 'en', 'value': 'ATL UTD FC'}, ...], ...}, 'claims': {'P31': [{'mainsnak': {'snaktype': 'value', 'property': 'P31', 'datavalue': {'value' {'entity-type': 'item', 'numeric-id': 476028, 'id': 'Q476028'}, 'type': 'wikibase-entityid'}, 'datatype': 'wikibase-item'}, 'type': 'statement', 'id': 'Q16836317$2462E96F-B25E-4BE9-9CAC-876FF99CD5DA', 'rank': 'normal'}, ... ], ...}, 'sitelinks': {'zhwiki': {'site': 'zhwiki', 'title': '阿特蘭大聯足球會', 'badges': []}, ...} 'lastrevid': 1452771827}, ...}	The mapping between the entity name and the entity content in wikidata. For detailed data structure, please refer to the official documents.

Column

Example Context

Description

News ID

N10000

Unique ID of news

Category

sports

Belong to one of 15 categories

Topic

soccer

Specific topic of news

Headline

Predicting Atlanta United's lineup against Columbus Crew in the U.S. Open Cup

News body

Only FIVE internationals allowed, count em, FIVE! So first off we should say, per our usual Atlanta United lineup predictions, this will be wrong...

Title entity

{"Atlanta United's": 'Atlanta United FC'}

The mapping between the phrase in title and the entity in wikidata

Entity content

{'Atlanta United FC': {
'type': 'item',
'id': 'Q16836317',
'labels': {'en': {'language': 'en', 'value': 'Atlanta United FC'}, ...},
'descriptions': {'en': {'language': 'en', 'value': 'Football team in the city of Atlanta, Georgia, United States'}, ...},
'aliases': {'en': [{'language': 'en', 'value': 'Atlanta United'}, {'language': 'en', 'value': 'ATL UTD'}, {'language': 'en', 'value': 'ATL UTD FC'}, ...], ...},
'claims': {'P31': [{'mainsnak': {'snaktype': 'value', 'property': 'P31', 'datavalue': {'value' {'entity-type': 'item', 'numeric-id': 476028, 'id': 'Q476028'}, 'type': 'wikibase-entityid'}, 'datatype': 'wikibase-item'}, 'type': 'statement', 'id': 'Q16836317$2462E96F-B25E-4BE9-9CAC-876FF99CD5DA', 'rank': 'normal'}, ... ], ...},
'sitelinks': {'zhwiki': {'site': 'zhwiki', 'title': '阿特蘭大聯足球會', 'badges': []}, ...}
'lastrevid': 1452771827}, ...}

The mapping between the entity name and the entity content in wikidata. For detailed data structure, please refer to the official documents.

Column	Example Context	Description
UserID	U335175	Unique ID of users
ClicknewsID	N41340 N27570 N83288 ...	The user’s historical clicked news
dwelltime	116 23 59 ...	The duration of browsing historical clicked news
exposure_time	6/19/2019 5:10:01 AM#TAB#...	The exposure time of historical clicked news and can be split by '#TAB#'
pos	N55476 N103556 N52756 ...	The clicked news in this impression
neg	N48119 N92507 N92467 ...	The unclicked news in this impression
start	7/3/2019 6:43:49 AM	Start time of this impression
end	7/3/2019 7:06:06 AM	End time of this impression
dwelltime_pos	34 83 79 ...	The duration of browsing clicked news in this impression

Column

Example Context

Description

UserID

U335175

Unique ID of users

ClicknewsID

N41340 N27570 N83288 ...

The user’s historical clicked news

dwelltime

116 23 59 ...

The duration of browsing historical clicked news

exposure_time

6/19/2019 5:10:01 AM#TAB#...

The exposure time of historical clicked news and can be split by '#TAB#'

pos

N55476 N103556 N52756 ...

The clicked news in this impression

neg

N48119 N92507 N92467 ...

The unclicked news in this impression

start

7/3/2019 6:43:49 AM

Start time of this impression

end

7/3/2019 7:06:06 AM

End time of this impression

dwelltime_pos

34 83 79 ...

The duration of browsing clicked news in this impression

The construction process of test set: To provide an offline testbed, we invited 103 English native speakers (all are college students) to manually create a test set by two stages.

At the first stage, each person browses 1,000 news headlines and marks at least 50 pieces he/she is interested in. These exhibited news were randomly selected from our news corpus and were arranged by their first exposure time.

At the second stage, everyone is asked to write down their preferred headlines for another 200 unseen news articles from our dataset without exhibiting them the original news titles, while highlighting some important segments in the original news articles as well. These unseen news articles are evenly sampled, and we redundantly assign them to make sure each news is exhibited to four people on average. The quality of these manually-written headlines were checked by professional editors from the perspective of the factual aspect of media frame. Low-quality headlines, e.g. containing wrong factual information, inconsistent with the news body, too-short or overlong, etc., are excluded. The rest are regarded as the personalized reading focuses of these annotators on the articles, and are taken as gold-standard headlines in our dataset.

Column	Example Context	Description
userid	NT1	The unique ID of 103 users
clicknewsID	N108480,N38238,N35068, ...	The user’s historical clicked news collected at the first stage
posnewID	N24110,N62769,N36186, ...	The exhibited news for each user at the second stage
rewrite_titles	'Legal battle looms over Trump EPA\'s rule change of Obama\'s Clean Power Plan rule ...	The manually-written news headlines for the exhibited news articles and can be split by '#TAB#'

Column

Example Context

Description

userid

NT1

The unique ID of 103 users

clicknewsID

N108480,N38238,N35068, ...

The user’s historical clicked news collected at the first stage

posnewID

N24110,N62769,N36186, ...

The exhibited news for each user at the second stage

rewrite_titles

'Legal battle looms over Trump EPA\'s rule change of Obama\'s Clean Power Plan rule ...

The manually-written news headlines for the exhibited news articles and can be split by '#TAB#'

PENS DATASET