Skip to content

ispras/news-page-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 

Repository files navigation

ISPRAS News Datasets Collection

Dataset For Information Extraction From News Web Pages

Multilingual dataset of labeled news web pages for information extraction task.

Dataset Description

Dataset contains websites in 44 languages. We labeled such attributes on news web pages: title, publication date, text, authors, tags. Some sites may also have subtitle, sources and categories annotations.

We presented the statistics for the number of sites, pages and labeled nodes in the AE_DATASET_STATS.md file.

We also have a separate dataset for Russian news sites. We labeled there title, subtitle, publication date, modification date, text, authors, sources, categories and tags.

Data Collection

For multilingual dataset, we marked up nodes on pages using sitemaps created with the Web Scraper.

Creating the Russian dataset is described in our paper. The annotators marked up web pages using Label Studio according to the guideline.

Dataset Format

For the multilingual dataset we have JSON for each language with the following structure:

{'site': [
  {
    'uuid':
    'url':
    'html':
    'annotations': [
      {
        'xpath':
        'text':
        'label':
      },
      ...]
  },
  ...],
...}

JSON structure for the Russian dataset is the Label Studio JSON MIN format:

[
  {
    'id':
    'url':
    'html':
    'html_en':
    'agency':
    'site':
    'title':
    'annotator':
    'annotation_id':
    'created_at':
    'updated_at':
    'lead_time':
    'labels': [
      {
        'text':
        'hypertextlabels':
        'start':
        'end':
        'endOffset':
        'startOffset':
        'globalOffsets':
      },
      ...]
  },
...]

We additionally added html_en with translated HTML into English.

Download

Citation

More details about the Russian-language dataset are available in our paper. Please cite us if you use or discuss this dataset in your work:

@INPROCEEDINGS{10076872,
  author={Varlamov, Maksim and Galanin, Denis and Bedrin, Pavel and Duda, Sergey and Lazarev, Vladimir and Yatskov, Alexander},
  booktitle={2022 Ivannikov Ispras Open Conference (ISPRAS)}, 
  title={A Dataset for Information Extraction from News Web Pages}, 
  year={2022},
  volume={},
  number={},
  pages={100-106},
  keywords={Annotations;Neural networks;Web pages;Data aggregation;Information retrieval;Data mining;Electronic commerce;web data extraction;information extraction;news;webpage dataset;neural networks},
  doi={10.1109/ISPRAS57371.2022.10076872}}

NewsListDataset

Dataset for extracting news records with their attributes from html pages.

Dataset Description

This dataset contains pages with lists of news in Russian. The following attributes were marked: title, date, tag, short_text, time, short_title, author.

Their distribution:

Pages Records Domains
title 12679 247262 275
date 12296 241634 251
tag 6165 108400 140
short_text 6855 115983 138
time 1938 41892 8
short_title 105 1289 4
author 87 957 1

Totally dataset contains 13099 pages.

Dataset Format

Each file from data folder is instance of json dictionary with fields:

  • html: formatted html code of page
  • exist_labels: labels which are located at html
  • domain: domain of page
  • labeled_xpaths: dictionary of xpaths and its labels
  • timestamp: timestamp of date, when page was loaded
  • url: url of page
  • record_xpaths: xpaths of block-nodes (first text node of each record)

Download

This file is dump of python-like list object, each item of it is instance of dictionary with fields described at Dataset Format. So the size of list is 13099 items.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •