Skip to content

Extracting Data

chastk edited this page Jul 25, 2019 · 5 revisions

SETLr supports extracting data from tabular sources, XML, JSON, and parsed HTML. These are described by making an artifact that prov:wasGeneratedBy some setl:Extract. In all cases, the URI that the setl:Extract prov:used is the URL of the resource to be downloaded. The types of the artifact determine which methods are used to extract the file into an in-memory representation. This takes the following form:

<?artifact> a <?filetype>;
  prov:wasGeneratedBy [
    a setl:Extract;
    prov:used <?downloadurl>;
  ].

Here, ?artifact is the URI of the artifact that will be processed by transform and load actions. The ?filetype is the method used to parse the file, and is discussed in the sections below. The ?downloadurl is URL that the file will be downloaded from using a plain HTTP GET.

When iterating through the elements of the resource, each element is stored in the row variable. The index of the element is stored in the i variable.

Extracting Tabular Data

The following example describes a process where a setl:Table entity, called :table, is generated by a setl:Extract activity that uses the file social.csv:

:table a csvw:Table, setl:Table;
  csvw:delimiter ",";
  prov:wasGeneratedBy [
    a setl:Extract;
    prov:used <social.csv>;
  ].

The type csvw:Table tells setlr that the table is to be interpreted as a CSV table, using the CSV on the Web vocabulary. SETLr supports the ability to indicate the delimiter used (using csvw:delimiter) and the number of initial rows to skip (using csvw:skipRows) in the file. _setl:Table_s are parsed into a data frame object using Pandas internally, and directly extracting RDF files is also supported. SETLr supports extracting the following data types:

Type Format Options Parsed Type
csvw:Table, setl:Table Comma (or other) Separated Value (CSV, TSV, etc.) csvw:delimiter, csvw:skipRows Data Frame
setl:XPORT, setl:Table SAS Transport (XPORT) file format Data Frame
setl:SAS7BDAT, setl:Table SAS Dataset file format Data Frame
setl:Excel, setl:Table XLS or XLSX file format setl:sheetname Data Frame
setl:HTML HTML File Beautiful Soup Parse Tree
owl:Ontology OWL Ontology file in RDF RDF Graph
void:Dataset RDF File RDF Graph

Data extracted from tabular data are provided as Pandas Series objects, and the file is streamed through in chunks.

Compressed Files

SETLr supports the use of compressed files through the addition of a type to the artifact. setl:ZipFile will take the first file from a zip file, and setl:GZipFile will take the file from a gzip file. Add it to the types of the generated artifact:

<?x> a setl:ZipFile.

Extracting XML Data

SETLr supports iterating through XML files using the LXML implementation of XPath by setting the artifact to type setl:XML. This is done using the setl:xpath property on the artifact that is generated. To use namespaced elements include the namespace in {} before the local element name. If you would like to process the entire XML tree at once, leave out the setl:xpath assertion to get the entire tree as a single row.

<?x> a setl:XML;
  setl:xpath "/path/to/{http://example.com/namespace}element";
  prov:wasGeneratedBy [
     a setl:Extract;
     prov:used <?downloadurl>;
].

The template will be called for every xpath match, with the subtree assigned as the row variable. The object is a Python etree Element.

Extracting JSON Data

JSON can be loaded by setting the artifact type to setl:JSON, and can be streamed over using the ijson library selector language using the api_vocab:selector property.

<?x> a setl:JSON;
  api_vocab:selector "item";
  prov:wasGeneratedBy [
     a setl:Extract;
     prov:used <?downloadurl>
  ].

Templates iterate over the stream of JSON subtrees assigned as row for each match. The object is given as as if it were parsed by the Python JSON module, as the appropriate combination of lists, dicts, strings, ints, and floats.

Extracting Custom Data

This is a little more complicated, but allows for integration of custom data parsers and preprocessors. SETLr enables the embedding of Python scripts within its workflows, and can provide artifacts in a similar way to the built-in extractors. Start by downloading an artifact as plain text, which will provide the stream without attempting to parse it, and then write a python script that uses the artifact. Set the result variable to be an enumeration over whatever parsed entries that come out of the parser, and those will be passed as the row object to your template.

<?artifact> a <https://www.iana.org/assignments/media-types/text/plain>;
  prov:wasGeneratedBy [
    a setl:Extract;
    prov:used  <?downloadurl>
  ].

<?x> a owl:Class, prov:SoftwareAgent, setl:PythonScript;
  rdfs:subClassOf prov:Activity;
  prov:qualifiedDerivation [
    a prov:Derivation;
    prov:entity <?artifact>;
    prov:hadRole [ dcterms:identifier "input_file"]
  ];
  prov:value '''
import myparser # Whatever parser you want to include here, do it however you need to.
entries = myparser.load(input_file)
result = enumerate(entries)
'''.

Handling Streamed Data

If you are streaming through large files, it may be important to persist your RDF graph to disk instead of hold it in memory. If this is important, you will need to add the setl:Persisted type to any artifacts generated by templates that use streamed data. Otherwise, you'll just be streaming into memory and run out of space. This is not enabled automatically because there are many cases where in-memory graphs outperform storage on disk, for smaller outputs.