Skip to content
Hika van den Hoven edited this page May 21, 2016 · 15 revisions

The DataTreeGrab module contains of two main classes: the DataTreeGrab.HTMLtree class and the DataTreeGrab.JSONtree class. They read a given HTML or JSON page into a tree of nodes with properties and both derive from the DataTreeGrab.DATAtree class. Similar there are the DataTreeGrab.HTMLnode, DataTreeGrab.JSONnode and DataTreeGrab.DATAnode classes, but the normally will only get called internally. The DataTreeGrab.NULLnode class is used to indicate a Null search result.
For HTML every tag represents a node with the following properties:

  • tag: the tag-name: always lower-case
  • text: any containing text
  • tail: any tailing text
  • attributes[<name>]: the attributes with their content. The attribute-name is converted to lower-case.

For JSON every list, dict and value represents a node with as properties:

  • type: [list|dict|value]
  • key: either the numeric list-index or the dict key
  • keys[]: a list of the child keys
  • key_index[]: The reverse of the previous
  • value

Both have also the following properties:

  • parent: the parent node
  • children[]: a list of child-nodes
  • dtree: the tree and through it its root
  • level: its level, with the root being 0
  • child_index: an index among its siblings. This is the index in children[], keys[] and key_index[]

Trough these properties you can parse through the tree and select the desired data. At present in the JSONtree class the index for dicts has no meaning except as internal reference. To use it against the original JSON data, we first have to add our own parser that bypasses the Python randomizing of the order within a dict structure.