Multilingual dataset of labeled news web pages for information extraction task.
Dataset contains websites in 44 languages. We labeled such attributes on news web pages: title, publication date, text, authors, tags. Some sites may also have subtitle, sources and categories annotations.
We presented the statistics for the number of sites, pages and labeled nodes in the AE_DATASET_STATS.md file.
We also have a separate dataset for Russian news sites. We labeled there title, subtitle, publication date, modification date, text, authors, sources, categories and tags.
For multilingual dataset, we marked up nodes on pages using sitemaps created with the Web Scraper.
Creating the Russian dataset is described in our paper. The annotators marked up web pages using Label Studio according to the guideline.
For the multilingual dataset we have JSON for each language with the following structure:
{'site': [
{
'uuid':
'url':
'html':
'annotations': [
{
'xpath':
'text':
'label':
},
...]
},
...],
...}
JSON structure for the Russian dataset is the Label Studio JSON MIN format:
[
{
'id':
'url':
'html':
'html_en':
'agency':
'site':
'title':
'annotator':
'annotation_id':
'created_at':
'updated_at':
'lead_time':
'labels': [
{
'text':
'hypertextlabels':
'start':
'end':
'endOffset':
'startOffset':
'globalOffsets':
},
...]
},
...]
We additionally added html_en
with translated HTML into English.
- Multilingual dataset (8.4 GB):
multilingual-ae/
- Multilingual web pages in MHTML (zipped 43.9 GB):
multilingual-ae-mhtml.zip
- Multilingual web pages in HTML (zipped 1.5 GB):
multilingual-ae-html.zip
- Russian dataset (178 MB):
russian.json
- Russian web pages in MHTML (zipped 1 GB):
russian-ae-mhtml.zip
More details about the Russian-language dataset are available in our paper. Please cite us if you use or discuss this dataset in your work:
@INPROCEEDINGS{10076872,
author={Varlamov, Maksim and Galanin, Denis and Bedrin, Pavel and Duda, Sergey and Lazarev, Vladimir and Yatskov, Alexander},
booktitle={2022 Ivannikov Ispras Open Conference (ISPRAS)},
title={A Dataset for Information Extraction from News Web Pages},
year={2022},
volume={},
number={},
pages={100-106},
keywords={Annotations;Neural networks;Web pages;Data aggregation;Information retrieval;Data mining;Electronic commerce;web data extraction;information extraction;news;webpage dataset;neural networks},
doi={10.1109/ISPRAS57371.2022.10076872}}
Dataset for extracting news records with their attributes from html pages.
This dataset contains pages with lists of news in Russian. The following attributes were marked: title, date, tag, short_text, time, short_title, author.
Their distribution:
Pages | Records | Domains | |
---|---|---|---|
title | 12679 | 247262 | 275 |
date | 12296 | 241634 | 251 |
tag | 6165 | 108400 | 140 |
short_text | 6855 | 115983 | 138 |
time | 1938 | 41892 | 8 |
short_title | 105 | 1289 | 4 |
author | 87 | 957 | 1 |
Totally dataset contains 13099 pages.
Each file from data folder is instance of json dictionary with fields:
- html: formatted html code of page
- exist_labels: labels which are located at html
- domain: domain of page
- labeled_xpaths: dictionary of xpaths and its labels
- timestamp: timestamp of date, when page was loaded
- url: url of page
- record_xpaths: xpaths of block-nodes (first text node of each record)
- NewsListDataset (915 MB):
russian.json
This file is dump of python-like list object, each item of it is instance of dictionary with fields described at Dataset Format. So the size of list is 13099 items.