-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Model for tree structured Document Representation #1022
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,388 @@ | |||
from abc import abstractmethod, ABC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your intention to merge this PR into Sycamore, or just to use it as a vehicle to discuss the data model?
(To be clear, I want to get this into Sycamore, just want to get a sense of whether we should discuss the nuts and bolts now defer that).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have opinions on this. It's quite isolated from other transforms today, so should be able to put anywhere.
Group = 102 | ||
|
||
|
||
class Node: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Again we can defer this, but we should probably call this something like DocNode since we already use Node for the query plan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, or differentiate by package name.
lib/sycamore/sycamore/document.py
Outdated
def category(self): | ||
pass | ||
|
||
def metadata(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intended to replace properties? What is the signature, dict[str, Any]
? Also, how would you update it if you want to add properties from like entity extraction or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's dict[str, Any]. Maybe not get your point, a visitor could help here.
return self._children | ||
|
||
|
||
class Caption(Leaf): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly not critical, but would it make sense to create a parent class for all of the text-like leaf nodes (e.g. list items, text, etc.) I could imagine a lot of operations where you want to do something special for tables and images, but something common for all the text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure.
def category(self): | ||
return Category.Caption | ||
|
||
def continued(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the semantics here? Why does Text have a continued field, but not, say, tables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for the split objects. Table does have, it's just nothing is support there like merge across pages.
return visitor.visit_section(self) | ||
|
||
|
||
class Visitor(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, what was your rationale for using a visitor like this vs a traverser like we do for plan nodes (
class NodeTraverse: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the consideration is to have an easy support of polymorphism and scalability. One thing I think a long time is how to handle split objects nicely where I end up with a linked list. If you have a better idea, let me know.
I'm not aware of this nodetraverse before. I looked a little bit on the printtraverse, seems it still either depend on modifying each node for polymorphism or squeeze in a big function handling all various node types inside those before/after/visit function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented on this above. The important thing about NodeTraverse is it lets you do pre and post traversal mutations of the tree. It also supported running a function once at the start of traversal of a tree.
If there is no intent to mutate the tree, then the order doesn't matter, but then I would expect the visitor to not be returning things.
Probably better to work this out in a design doc than code though.
|
||
class Visitor(ABC): | ||
@abstractmethod | ||
def visit_caption(self, caption: Caption): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the visit methods mutate the tree or are they purely read only? Do they return anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm just assuming it could return anything considering the dynamic typing, no generic or subtying needed here at all?
@abstractmethod | ||
def visit_section(self, section: Section): | ||
pass | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like a visit_generic or something would be useful. Something like
@abstractmethod
def visit_generic(self, node: Node):
pass
and then the visit_some_type
methods can default to calling visit_generic. I'm just looking at these and thinking it will be an absolute pain to have to explicitly handle each type when I wanna implement this interface ya know? Could save a bunch of code duplication in the NaiveSemanticVisitor eg
return Category.Caption | ||
|
||
def accept(self, visitor): | ||
return visitor.visit_caption(self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend moving the switch logic into the visitor class.
In particular, I would have:
class Caption(Leaf):
def __init__(self, node_id, box, content, properties):
super().__init__(node_id, box, content, properties, category=Category.Caption)
class Leaf(...):
def category(self):
return self.category
def accept(self, visitor):
return visitor.visit(self, self.category) # optional to just have self since the category is bound
class Visitor():
def visit(self, node, category):
if category == Category.Caption:
self.visit_caption(node)
elif ...
That way someone can override just the visit function if they want to handle the grouping differently rather than having to implement a whole bunch of functions.
You may also want to implement more flexible visiting. For plan nodes we needed pre-traversal, post-traversal, in-place mutation, and single call for the entire tree.
See plan_nodes.py:NodeTraverse. My guess is that if we're going to do mutations of the structure via traversal, we'll end up needing the same thing.
return visitor.visit_section(self) | ||
|
||
|
||
class Visitor(ABC): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented on this above. The important thing about NodeTraverse is it lets you do pre and post traversal mutations of the tree. It also supported running a function once at the start of traversal of a tree.
If there is no intent to mutate the tree, then the order doesn't matter, but then I would expect the visitor to not be returning things.
Probably better to work this out in a design doc than code though.
|
||
|
||
class NaiveSemanticVisitor(Visitor): | ||
def visit_caption(self, caption): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is exactly why I would have a single visit function, then it could just be:
def visit (self, node, type):
assert type != Content.Picture, "unimplemented"
if type == Content.Table:
return node.content().to_csv()
if type == Content.Text:
contents = text.content() # why isn't there a text.all_content()?
cur = text.continued()
while cur:
contents = contents.rstrip() + " " + cur.content()
cur = cur.continued()
return contents
if type == Content.Group or type == Content.Section:
contents = [child.accept(self) for child in group.children()]
return " ".join(contents)
return node.content()
This more than halves the lines of code, makes it clear which things are handled similarly, and likely does the right thing if a new node type is introduced.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thanks for your feedback. I actually have different opinions on this.
- this does not align with a pre-existing design pattern
- Considering code scalability, I feel there is more requirement to extend new functions/transformations instead of a new node type.
- the visitor considers more about the inheritance, for the above naivesemanticvisitor example, it's possible to have requirement for a newer semanticvisitor which only modifies the function for handling table and leave all the other unchanged.
token_dict = {} | ||
chunks = [] | ||
|
||
def post_traverse(node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
traversal should be an operation on a node so you don't have to reimplement it.
I think you should be able to write:
def post_fn(node):
semantic = node.accept(visitor)
count = tokenizer.count(semantic)
token_dict[node.node_id()] = (count, semantic) # I would have stored this in the node
def pre_traverse(node):
count, semantic = token_dict[node.node_id()] # I would fetch this from the node
if count <= tokenizer.limit():
chunks.append(semantic)
return None # abort recursion
if isinstance(node, Internal):
return node
chunks.append(semantic)
self._root.traverse(post_fn = post_fn, mutating=False)
self._root.traverse(pre_fn = pre_fn, mutating=False)
No description provided.