-
Notifications
You must be signed in to change notification settings - Fork 6
Extracting Data
SETLr supports extracting data from tabular sources, XML, JSON, and parsed HTML. These are described by making an artifact that prov:wasGeneratedBy some setl:Extract. In all cases, the URI that the setl:Extract prov:used is the URL of the resource to be downloaded. The types of the artifact determine which methods are used to extract the file into an in-memory representation. This takes the following form:
<?artifact> a <?filetype>;
prov:wasGeneratedBy [
a setl:Extract;
prov:used <?downloadurl>;
].
Here, ?artifact
is the URI of the artifact that will be processed by transform and load actions. The ?filetype
is the method used to parse the file, and is discussed in the sections below. The ?downloadurl
is URL that the file will be downloaded from using a plain HTTP GET.
When iterating through the elements of the resource, each element is stored in the row
variable. The index of the element is stored in the i
variable.
The following example describes a process where a setl:Table entity, called :table
, is generated by a setl:Extract activity that uses the file social.csv
:
:table a csvw:Table, setl:Table;
csvw:delimiter ",";
prov:wasGeneratedBy [
a setl:Extract;
prov:used <social.csv>;
].
The type csvw:Table tells setlr that the table is to be interpreted as a CSV table, using the CSV on the Web vocabulary. SETLr supports the ability to indicate the delimiter used (using csvw:delimiter) and the number of initial rows to skip (using csvw:skipRows) in the file. _setl:Table_s are parsed into a data frame object using Pandas internally, and directly extracting RDF files is also supported. SETLr supports extracting the following data types:
Type | Format | Options | Parsed Type |
---|---|---|---|
csvw:Table, setl:Table | Comma (or other) Separated Value (CSV, TSV, etc.) | csvw:delimiter, csvw:skipRows | Data Frame |
setl:XPORT, setl:Table | SAS Transport (XPORT) file format | Data Frame | |
setl:SAS7BDAT, setl:Table | SAS Dataset file format | Data Frame | |
setl:Excel, setl:Table | XLS or XLSX file format | Data Frame | |
setl:HTML | HTML File | Beautiful Soup Parse Tree | |
owl:Ontology | OWL Ontology file in RDF | RDF Graph | |
void:Dataset | RDF File | RDF Graph |
Data extracted from tabular data are provided as Pandas Series objects, and the file is streamed through in chunks.
SETLr supports the use of compressed files through the addition of a type to the artifact. setl:ZipFile will take the first file from a zip file, and setl:GZipFile will take the file from a gzip file. Add it to the types of the generated artifact:
<?x> a setl:ZipFile.
SETLr supports iterating through XML files using the LXML implementation of XPath by setting the artifact to type setl:XML. This is done using the setl:xpath property on the artifact that is generated. To use namespaced elements include the namespace in {}
before the local element name. If you would like to process the entire XML tree at once, leave out the setl:xpath assertion to get the entire tree as a single row.
<?x> a setl:XML;
setl:xpath "/path/to/{http://example.com/namespace}element";
prov:wasGeneratedBy [
a setl:Extract;
prov:used <?downloadurl>;
].
The template will be called for every xpath match, with the subtree assigned as the row
variable. The object is a Python etree Element.
JSON can be loaded by setting the artifact type to setl:JSON, and can be streamed over using the ijson library selector language using the api_vocab:selector property.
<?x> a setl:JSON;
api_vocab:selector "item";
prov:wasGeneratedBy [
a setl:Extract;
prov:used <?downloadurl>
].
Templates iterate over the stream of JSON subtrees assigned as row
for each match. The object is given as as if it were parsed by the Python JSON module, as the appropriate combination of lists, dicts, strings, ints, and floats.
This is a little more complicated, but allows for integration of custom data parsers and preprocessors. SETLr enables the embedding of Python scripts within its workflows, and can provide artifacts in a similar way to the built-in extractors. Start by downloading an artifact as plain text, which will provide the stream without attempting to parse it, and then write a python script that uses the artifact. Set the result variable to be an enumeration over whatever parsed entries that come out of the parser, and those will be passed as the row
object to your template.
<?artifact> a <https://www.iana.org/assignments/media-types/text/plain>;
prov:wasGeneratedBy [
a setl:Extract;
prov:used <?downloadurl>
].
<?x> a owl:Class, prov:SoftwareAgent, setl:PythonScript;
rdfs:subClassOf prov:Activity;
prov:qualifiedDerivation [
a prov:Derivation;
prov:entity <?artifact>;
prov:hadRole [ dcterms:identifier "input_file"]
];
prov:value '''
import myparser # Whatever parser you want to include here, do it however you need to.
entries = myparser.load(input_file)
result = enumerate(entries)
'''.
If you are streaming through large files, it may be important to persist your RDF graph to disk instead of hold it in memory. If this is important, you will need to add the setl:Persisted type to any artifacts generated by templates that use streamed data. Otherwise, you'll just be streaming into memory and run out of space. This is not enabled automatically because there are many cases where in-memory graphs outperform storage on disk, for smaller outputs.