Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Model for tree structured Document Representation #1022

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bohou-aryn
Copy link
Collaborator

No description provided.

@@ -0,0 +1,388 @@
from abc import abstractmethod, ABC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your intention to merge this PR into Sycamore, or just to use it as a vehicle to discuss the data model?

(To be clear, I want to get this into Sycamore, just want to get a sense of whether we should discuss the nuts and bolts now defer that).

Copy link
Collaborator Author

@bohou-aryn bohou-aryn Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have opinions on this. It's quite isolated from other transforms today, so should be able to put anywhere.

Group = 102


class Node:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Again we can defer this, but we should probably call this something like DocNode since we already use Node for the query plan.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, or differentiate by package name.

def category(self):
pass

def metadata(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intended to replace properties? What is the signature, dict[str, Any]? Also, how would you update it if you want to add properties from like entity extraction or something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's dict[str, Any]. Maybe not get your point, a visitor could help here.

return self._children


class Caption(Leaf):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly not critical, but would it make sense to create a parent class for all of the text-like leaf nodes (e.g. list items, text, etc.) I could imagine a lot of operations where you want to do something special for tables and images, but something common for all the text.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure.

def category(self):
return Category.Caption

def continued(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the semantics here? Why does Text have a continued field, but not, say, tables?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for the split objects. Table does have, it's just nothing is support there like merge across pages.

return visitor.visit_section(self)


class Visitor(ABC):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, what was your rationale for using a visitor like this vs a traverser like we do for plan nodes (

class NodeTraverse:
). To be clear, I don't think I have a strong opinion either way, just curious if you had a reason to prefer one over the other.

Copy link
Collaborator Author

@bohou-aryn bohou-aryn Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the consideration is to have an easy support of polymorphism and scalability. One thing I think a long time is how to handle split objects nicely where I end up with a linked list. If you have a better idea, let me know.

I'm not aware of this nodetraverse before. I looked a little bit on the printtraverse, seems it still either depend on modifying each node for polymorphism or squeeze in a big function handling all various node types inside those before/after/visit function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on this above. The important thing about NodeTraverse is it lets you do pre and post traversal mutations of the tree. It also supported running a function once at the start of traversal of a tree.

If there is no intent to mutate the tree, then the order doesn't matter, but then I would expect the visitor to not be returning things.

Probably better to work this out in a design doc than code though.


class Visitor(ABC):
@abstractmethod
def visit_caption(self, caption: Caption):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the visit methods mutate the tree or are they purely read only? Do they return anything?

Copy link
Collaborator Author

@bohou-aryn bohou-aryn Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just assuming it could return anything considering the dynamic typing, no generic or subtying needed here at all?

@abstractmethod
def visit_section(self, section: Section):
pass

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like a visit_generic or something would be useful. Something like

@abstractmethod
def visit_generic(self, node: Node):
    pass

and then the visit_some_type methods can default to calling visit_generic. I'm just looking at these and thinking it will be an absolute pain to have to explicitly handle each type when I wanna implement this interface ya know? Could save a bunch of code duplication in the NaiveSemanticVisitor eg

return Category.Caption

def accept(self, visitor):
return visitor.visit_caption(self)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend moving the switch logic into the visitor class.
In particular, I would have:

class Caption(Leaf):
   def __init__(self, node_id, box, content, properties):
        super().__init__(node_id, box, content, properties, category=Category.Caption)
        
class Leaf(...):
   def category(self):
        return self.category

   def accept(self, visitor):
       return visitor.visit(self, self.category) # optional to just have self since the category is bound

class Visitor():
   def visit(self, node, category):
        if category == Category.Caption:
            self.visit_caption(node)
       elif ...

That way someone can override just the visit function if they want to handle the grouping differently rather than having to implement a whole bunch of functions.

You may also want to implement more flexible visiting. For plan nodes we needed pre-traversal, post-traversal, in-place mutation, and single call for the entire tree.
See plan_nodes.py:NodeTraverse. My guess is that if we're going to do mutations of the structure via traversal, we'll end up needing the same thing.

return visitor.visit_section(self)


class Visitor(ABC):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented on this above. The important thing about NodeTraverse is it lets you do pre and post traversal mutations of the tree. It also supported running a function once at the start of traversal of a tree.

If there is no intent to mutate the tree, then the order doesn't matter, but then I would expect the visitor to not be returning things.

Probably better to work this out in a design doc than code though.



class NaiveSemanticVisitor(Visitor):
def visit_caption(self, caption):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is exactly why I would have a single visit function, then it could just be:

def visit (self, node, type):
    assert type != Content.Picture, "unimplemented"
     if type == Content.Table:
          return node.content().to_csv()
    if type == Content.Text:
         contents = text.content() # why isn't there a text.all_content()?
         cur = text.continued()
         while cur:
             contents = contents.rstrip() + " " + cur.content()
             cur = cur.continued()

        return contents
    if type == Content.Group or type == Content.Section:
        contents = [child.accept(self) for child in group.children()]
        return " ".join(contents)
    return node.content()

This more than halves the lines of code, makes it clear which things are handled similarly, and likely does the right thing if a new node type is introduced.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for your feedback. I actually have different opinions on this.

  1. this does not align with a pre-existing design pattern
  2. Considering code scalability, I feel there is more requirement to extend new functions/transformations instead of a new node type.
  3. the visitor considers more about the inheritance, for the above naivesemanticvisitor example, it's possible to have requirement for a newer semanticvisitor which only modifies the function for handling table and leave all the other unchanged.

token_dict = {}
chunks = []

def post_traverse(node):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

traversal should be an operation on a node so you don't have to reimplement it.
I think you should be able to write:

def post_fn(node):
   semantic = node.accept(visitor)
   count = tokenizer.count(semantic)
   token_dict[node.node_id()] = (count, semantic) # I would have stored this in the node

def pre_traverse(node):
   count, semantic = token_dict[node.node_id()] # I would fetch this from the node
   if count <= tokenizer.limit():
       chunks.append(semantic)
       return None # abort recursion
   if isinstance(node, Internal):
       return node
   chunks.append(semantic)
   
self._root.traverse(post_fn = post_fn, mutating=False)
self._root.traverse(pre_fn = pre_fn, mutating=False)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants