Skip to content
Hika van den Hoven edited this page Jun 26, 2016 · 15 revisions

The DataTreeGrab module consists of two main classes: the DataTreeGrab.HTMLtree class and the DataTreeGrab.JSONtree class. They read a given HTML or JSON page into a tree of nodes with properties and both derive from the DataTreeGrab.DATAtree class.
Similar there are the DataTreeGrab.HTMLnode, DataTreeGrab.JSONnode and DataTreeGrab.DATAnode classes, but they normally will only get called internally. The DataTreeGrab.NULLnode class is used to indicate a Null search result. At present most errors are silently handled internally.
With version 1.1 there is a warnings framework. Also with this version there is a new DataTreeGrab.DataTreeShell class.

Every node has the following properties:

  • parent: the parent node
  • children[]: a list of child-nodes
  • dtree: the tree and through it its root
  • level: its level, with the root being 0
  • child_index: an index among its siblings. This is the index in children[], keys[] and key_index[]

For HTML every tag represents a node with the following additional properties:

  • tag: the tag-name: always lower-case
  • text: any containing text
  • tail: any tailing text
  • attributes[<name>]: the attributes with their content. The attribute-name is converted to lower-case.

For JSON every list, dict and value represents a node with as additional properties:

  • type: [list|dict|value]
  • key: either the numeric list-index or the dict key
  • keys[]: a list of the child keys
  • key_index[]: The reverse of the previous
  • value

Through these properties you can parse through the tree and select the desired data. At present in the JSONtree class the index for dicts has no meaning except as internal reference. To use it against the original JSON data, we first have to add our own parser that bypasses the Python randomizing of the order within a dict structure.