Data Model for tree structured Document Representation #1022

bohou-aryn · 2024-11-13T23:43:46Z

No description provided.

bsowell · 2024-11-15T18:48:41Z

lib/sycamore/sycamore/document.py

@@ -0,0 +1,388 @@
+from abc import abstractmethod, ABC


Is your intention to merge this PR into Sycamore, or just to use it as a vehicle to discuss the data model?

(To be clear, I want to get this into Sycamore, just want to get a sense of whether we should discuss the nuts and bolts now defer that).

I don't have opinions on this. It's quite isolated from other transforms today, so should be able to put anywhere.

bsowell · 2024-11-15T18:50:07Z

lib/sycamore/sycamore/document.py

+    Group = 102
+
+
+class Node:


minor: Again we can defer this, but we should probably call this something like DocNode since we already use Node for the query plan.

sure, or differentiate by package name.

bsowell · 2024-11-15T18:51:56Z

lib/sycamore/sycamore/document.py

+    def category(self):
+        pass
+
+    def metadata(self):


Is this intended to replace properties? What is the signature, dict[str, Any]? Also, how would you update it if you want to add properties from like entity extraction or something?

yes, it's dict[str, Any]. Maybe not get your point, a visitor could help here.

bsowell · 2024-11-15T18:54:28Z

lib/sycamore/sycamore/document.py

+        return self._children
+
+
+class Caption(Leaf):


Certainly not critical, but would it make sense to create a parent class for all of the text-like leaf nodes (e.g. list items, text, etc.) I could imagine a lot of operations where you want to do something special for tables and images, but something common for all the text.

bsowell · 2024-11-15T18:55:36Z

lib/sycamore/sycamore/document.py

+    def category(self):
+        return Category.Caption
+
+    def continued(self):


What are the semantics here? Why does Text have a continued field, but not, say, tables?

This is for the split objects. Table does have, it's just nothing is support there like merge across pages.

bsowell · 2024-11-15T18:58:07Z

lib/sycamore/sycamore/document.py

+        return visitor.visit_section(self)
+
+
+class Visitor(ABC):


Just curious, what was your rationale for using a visitor like this vs a traverser like we do for plan nodes (

sycamore/lib/sycamore/sycamore/plan_nodes.py

Line 10 in 9ddcaef

class NodeTraverse:

). To be clear, I don't think I have a strong opinion either way, just curious if you had a reason to prefer one over the other.

So the consideration is to have an easy support of polymorphism and scalability. One thing I think a long time is how to handle split objects nicely where I end up with a linked list. If you have a better idea, let me know.

I'm not aware of this nodetraverse before. I looked a little bit on the printtraverse, seems it still either depend on modifying each node for polymorphism or squeeze in a big function handling all various node types inside those before/after/visit function.

I commented on this above. The important thing about NodeTraverse is it lets you do pre and post traversal mutations of the tree. It also supported running a function once at the start of traversal of a tree.

If there is no intent to mutate the tree, then the order doesn't matter, but then I would expect the visitor to not be returning things.

Probably better to work this out in a design doc than code though.

bsowell · 2024-11-15T18:58:56Z

lib/sycamore/sycamore/document.py

+
+class Visitor(ABC):
+    @abstractmethod
+    def visit_caption(self, caption: Caption):


Can the visit methods mutate the tree or are they purely read only? Do they return anything?

I'm just assuming it could return anything considering the dynamic typing, no generic or subtying needed here at all?

HenryL27 · 2024-11-19T20:59:05Z

lib/sycamore/sycamore/document.py

+    @abstractmethod
+    def visit_section(self, section: Section):
+        pass
+


I feel like a visit_generic or something would be useful. Something like

@abstractmethod def visit_generic(self, node: Node): pass

and then the visit_some_type methods can default to calling visit_generic. I'm just looking at these and thinking it will be an absolute pain to have to explicitly handle each type when I wanna implement this interface ya know? Could save a bunch of code duplication in the NaiveSemanticVisitor eg

eric-anderson · 2024-11-19T21:03:32Z

lib/sycamore/sycamore/document.py

+        return Category.Caption
+
+    def accept(self, visitor):
+        return visitor.visit_caption(self)


I would recommend moving the switch logic into the visitor class.
In particular, I would have:

class Caption(Leaf): def __init__(self, node_id, box, content, properties): super().__init__(node_id, box, content, properties, category=Category.Caption) class Leaf(...): def category(self): return self.category def accept(self, visitor): return visitor.visit(self, self.category) # optional to just have self since the category is bound class Visitor(): def visit(self, node, category): if category == Category.Caption: self.visit_caption(node) elif ...

That way someone can override just the visit function if they want to handle the grouping differently rather than having to implement a whole bunch of functions.

You may also want to implement more flexible visiting. For plan nodes we needed pre-traversal, post-traversal, in-place mutation, and single call for the entire tree.
See plan_nodes.py:NodeTraverse. My guess is that if we're going to do mutations of the structure via traversal, we'll end up needing the same thing.

eric-anderson · 2024-11-19T21:06:11Z

lib/sycamore/sycamore/document.py

+        return visitor.visit_section(self)
+
+
+class Visitor(ABC):


I commented on this above. The important thing about NodeTraverse is it lets you do pre and post traversal mutations of the tree. It also supported running a function once at the start of traversal of a tree.

If there is no intent to mutate the tree, then the order doesn't matter, but then I would expect the visitor to not be returning things.

Probably better to work this out in a design doc than code though.

eric-anderson · 2024-11-19T21:12:52Z

lib/sycamore/sycamore/document.py

+
+
+class NaiveSemanticVisitor(Visitor):
+    def visit_caption(self, caption):


This code is exactly why I would have a single visit function, then it could just be:

def visit (self, node, type): assert type != Content.Picture, "unimplemented" if type == Content.Table: return node.content().to_csv() if type == Content.Text: contents = text.content() # why isn't there a text.all_content()? cur = text.continued() while cur: contents = contents.rstrip() + " " + cur.content() cur = cur.continued() return contents if type == Content.Group or type == Content.Section: contents = [child.accept(self) for child in group.children()] return " ".join(contents) return node.content()

This more than halves the lines of code, makes it clear which things are handled similarly, and likely does the right thing if a new node type is introduced.

Hey, thanks for your feedback. I actually have different opinions on this.

this does not align with a pre-existing design pattern

Considering code scalability, I feel there is more requirement to extend new functions/transformations instead of a new node type.

the visitor considers more about the inheritance, for the above naivesemanticvisitor example, it's possible to have requirement for a newer semanticvisitor which only modifies the function for handling table and leave all the other unchanged.

eric-anderson · 2024-11-19T21:26:26Z

lib/sycamore/sycamore/document.py

+        token_dict = {}
+        chunks = []
+
+        def post_traverse(node):


traversal should be an operation on a node so you don't have to reimplement it.
I think you should be able to write:

def post_fn(node): semantic = node.accept(visitor) count = tokenizer.count(semantic) token_dict[node.node_id()] = (count, semantic) # I would have stored this in the node def pre_traverse(node): count, semantic = token_dict[node.node_id()] # I would fetch this from the node if count <= tokenizer.limit(): chunks.append(semantic) return None # abort recursion if isinstance(node, Internal): return node chunks.append(semantic) self._root.traverse(post_fn = post_fn, mutating=False) self._root.traverse(pre_fn = pre_fn, mutating=False)

bohou-aryn requested a review from bsowell November 13, 2024 23:43

bohou-aryn force-pushed the tree branch from 920aff1 to dda6e70 Compare November 14, 2024 17:33

bsowell reviewed Nov 15, 2024

View reviewed changes

Data Model for tree structured Document Representation

260f40b

bohou-aryn force-pushed the tree branch from dda6e70 to 260f40b Compare November 18, 2024 21:24

HenryL27 reviewed Nov 19, 2024

View reviewed changes

eric-anderson reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Model for tree structured Document Representation #1022

Data Model for tree structured Document Representation #1022

bohou-aryn commented Nov 13, 2024

bsowell Nov 15, 2024

bohou-aryn Nov 15, 2024 •

edited

Loading

bsowell Nov 15, 2024

bohou-aryn Nov 15, 2024

bsowell Nov 15, 2024

bohou-aryn Nov 15, 2024

bsowell Nov 15, 2024

bohou-aryn Nov 15, 2024

bsowell Nov 15, 2024

bohou-aryn Nov 15, 2024

bsowell Nov 15, 2024

bohou-aryn Nov 15, 2024 •

edited

Loading

eric-anderson Nov 19, 2024

bsowell Nov 15, 2024

bohou-aryn Nov 15, 2024 •

edited

Loading

HenryL27 Nov 19, 2024

eric-anderson Nov 19, 2024

eric-anderson Nov 19, 2024

eric-anderson Nov 19, 2024

bohou-aryn Nov 20, 2024

eric-anderson Nov 19, 2024



		class NaiveSemanticVisitor(Visitor):
		def visit_caption(self, caption):

Data Model for tree structured Document Representation #1022

Are you sure you want to change the base?

Data Model for tree structured Document Representation #1022

Conversation

bohou-aryn commented Nov 13, 2024

Choose a reason for hiding this comment

bohou-aryn Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bohou-aryn Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bohou-aryn Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bohou-aryn Nov 15, 2024 •

edited

Loading

bohou-aryn Nov 15, 2024 •

edited

Loading

bohou-aryn Nov 15, 2024 •

edited

Loading