ISPRAS News Datasets Collection

Dataset For Information Extraction From News Web Pages
NewsListDataset

Dataset For Information Extraction From News Web Pages

Multilingual dataset of labeled news web pages for information extraction task.

Dataset Description

Dataset contains websites in 44 languages. We labeled such attributes on news web pages: title, publication date, text, authors, tags. Some sites may also have subtitle, sources and categories annotations.

We presented the statistics for the number of sites, pages and labeled nodes in the AE_DATASET_STATS.md file.

We also have a separate dataset for Russian news sites. We labeled there title, subtitle, publication date, modification date, text, authors, sources, categories and tags.

Data Collection

For multilingual dataset, we marked up nodes on pages using sitemaps created with the Web Scraper.

Creating the Russian dataset is described in our paper. The annotators marked up web pages using Label Studio according to the guideline.

Dataset Format

For the multilingual dataset we have JSON for each language with the following structure:

{'site': [
  {
    'uuid':
    'url':
    'html':
    'annotations': [
      {
        'xpath':
        'text':
        'label':
      },
      ...]
  },
  ...],
...}

JSON structure for the Russian dataset is the Label Studio JSON MIN format:

[
  {
    'id':
    'url':
    'html':
    'html_en':
    'agency':
    'site':
    'title':
    'annotator':
    'annotation_id':
    'created_at':
    'updated_at':
    'lead_time':
    'labels': [
      {
        'text':
        'hypertextlabels':
        'start':
        'end':
        'endOffset':
        'startOffset':
        'globalOffsets':
      },
      ...]
  },
...]

We additionally added html_en with translated HTML into English.

Download

Multilingual dataset (8.4 GB): multilingual-ae/
Multilingual web pages in MHTML (zipped 43.9 GB): multilingual-ae-mhtml.zip
Multilingual web pages in HTML (zipped 1.5 GB): multilingual-ae-html.zip
Russian dataset (178 MB): russian.json
Russian web pages in MHTML (zipped 1 GB): russian-ae-mhtml.zip

Citation

More details about the Russian-language dataset are available in our paper. Please cite us if you use or discuss this dataset in your work:

@INPROCEEDINGS{10076872,
  author={Varlamov, Maksim and Galanin, Denis and Bedrin, Pavel and Duda, Sergey and Lazarev, Vladimir and Yatskov, Alexander},
  booktitle={2022 Ivannikov Ispras Open Conference (ISPRAS)}, 
  title={A Dataset for Information Extraction from News Web Pages}, 
  year={2022},
  volume={},
  number={},
  pages={100-106},
  keywords={Annotations;Neural networks;Web pages;Data aggregation;Information retrieval;Data mining;Electronic commerce;web data extraction;information extraction;news;webpage dataset;neural networks},
  doi={10.1109/ISPRAS57371.2022.10076872}}

NewsListDataset

Dataset for extracting news records with their attributes from html pages.

Dataset Description

This dataset contains pages with lists of news in Russian. The following attributes were marked: title, date, tag, short_text, time, short_title, author.

Their distribution:

	Pages	Records	Domains
title	12679	247262	275
date	12296	241634	251
tag	6165	108400	140
short_text	6855	115983	138
time	1938	41892	8
short_title	105	1289	4
author	87	957	1

Totally dataset contains 13099 pages.

Dataset Format

Each file from data folder is instance of json dictionary with fields:

html: formatted html code of page
exist_labels: labels which are located at html
domain: domain of page
labeled_xpaths: dictionary of xpaths and its labels
timestamp: timestamp of date, when page was loaded
url: url of page
record_xpaths: xpaths of block-nodes (first text node of each record)

Download

NewsListDataset (915 MB): russian.json

This file is dump of python-like list object, each item of it is instance of dictionary with fields described at Dataset Format. So the size of list is 13099 items.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
manifest_resources		manifest_resources
AE_DATASET_STATS.md		AE_DATASET_STATS.md
MANIFEST.md		MANIFEST.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ISPRAS News Datasets Collection

Dataset For Information Extraction From News Web Pages

Dataset Description

Data Collection

Dataset Format

Download

Citation

NewsListDataset

Dataset Description

Dataset Format

Download

About

Releases

Packages

Contributors 3

ispras/news-page-dataset

Folders and files

Latest commit

History

Repository files navigation

ISPRAS News Datasets Collection

Dataset For Information Extraction From News Web Pages

Dataset Description

Data Collection

Dataset Format

Download

Citation

NewsListDataset

Dataset Description

Dataset Format

Download

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Packages