Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow documents to be divided into chapters and sections #26

Open
atrigent opened this issue May 16, 2014 · 29 comments
Open

Allow documents to be divided into chapters and sections #26

atrigent opened this issue May 16, 2014 · 29 comments

Comments

@atrigent
Copy link
Member

The old thefinalclub website had a concept of "works" that were divided into "chapters" that were subdivided into "sections". This allowed large works (books, plays, etc) to be used on the website, because only a certain section would need to be shown to the user at any given time. Showing an entire work on a single page would require a lot of data transfer and would be a strain on the user's browser.

Annotation studio only has a concept of "documents". When creating a document, annotation studio allows the user to specify the code for chapter navigation, but does not store chapters in a structured way and requires the user to write HTML code for the navigation. An entire document will be displayed on a single page.

We need a structured way to break works down into chapters and sections. This will probably require messing with the existing models and creating some new ones.

@AndrewMagliozzi
Copy link
Member

If by "will probably require messing with the existing models" you meant
*definitely, then I am totally on board with this ticket.

On Fri, May 16, 2014 at 3:45 PM, Ari Entlich [email protected]:

The old thefinalclub website had a concept of "works" that were divided
into "chapters" that were subdivided into "sections". This allowed large
works (books, plays, etc) to be used on the website, because only a certain
section would need to be shown to the user at any given time. Showing an
entire work on a single page would require a lot of data transfer and would
be a strain on the user's browser.

Annotation studio only has a concept of "documents". When creating a
document, annotation studio allows the user to specify the code for chapter
navigation, but does not store chapters in a structured way and requires
the user to write HTML code for the navigation. An entire document will be
displayed on a single page.

We need a structured way to break works down into chapters and sections.
This will probably require messing with the existing models and creating
some new ones.


Reply to this email directly or view it on GitHubhttps://github.com//issues/26
.

@atrigent
Copy link
Member Author

Here's some analysis of how the old site is structured. There are three tables involved:

CREATE TABLE `works` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `title` text NOT NULL,
  `author` text NOT NULL,
  `summary` text NOT NULL,
  `year` int(11) NOT NULL,
  `page_views` bigint(20) unsigned NOT NULL,
  `wordpress_url` text NOT NULL,
  `intro_essay` mediumtext NOT NULL,
  `created_on` datetime NOT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `author` (`author`),
  FULLTEXT KEY `title` (`title`)
) ENGINE=MyISAM AUTO_INCREMENT=391 DEFAULT CHARSET=latin1
CREATE TABLE `sections` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `work_id` int(10) unsigned NOT NULL,
  `order` int(10) unsigned NOT NULL,
  `name` text NOT NULL,
  PRIMARY KEY (`id`),
  KEY `work_id` (`work_id`)
) ENGINE=MyISAM AUTO_INCREMENT=19140 DEFAULT CHARSET=latin1
CREATE TABLE `content` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `section_id` int(10) unsigned NOT NULL,
  `content` mediumtext NOT NULL,
  PRIMARY KEY (`id`),
  KEY `section_id` (`section_id`)
) ENGINE=MyISAM AUTO_INCREMENT=18400 DEFAULT CHARSET=latin1

Each row in works encapsulates a whole work, and includes metadata such as author, publication year, etc.

Each row in sections denotes a subdivision of a work. The work_id column denotes the section's parent work. Most, but not all, works have sections with meaningful order values, denoting the order of sections within a work. For some works, all section order values are 0. One interesting thing is that chapters are also stored in this table.

Each row in content contains the text of a section. The section_id column denotes which section this is the content for. The way to distinguish between chapters and chapter subdivisions in the sections table seems to be whether that section has any content.

@atrigent
Copy link
Member Author

The tree structure of chapters and subchapters seems to be derived from the names of the sections. For example, in the following sections from Hamlet (work id 5):

id work_id order name
13 5 0 Act I
14 5 0 Act I, Scene I: Elsinore. A platform before the castle.
15 5 0 Act I, Scene II: A room of state in the castle.
16 5 0 Act I, Scene III: A room in Polonius' house.
17 5 0 Act I, Scene IV: The Platform.
18 5 0 Act I, Scene V: Another part of the platform.
19 5 0 Act II
20 5 0 Act II, Scene I: A room in Polonius' house
199 5 0 Act II, Scene II: A room in the castle.
202 5 0 Act III
204 5 0 Act III, Scene I: A room in the castle.
213 5 0 Act III, Scene II: A hall in the castle.
217 5 0 Act III, Scene III: A room in the castle.
224 5 0 Act III, Scene IV: The Queen's closet.
226 5 0 Act IV
229 5 0 Act IV , Scene I: A room in the castle.
233 5 0 Act IV , Scene II: Another room in the castle,
236 5 0 Act IV , Scene III: Another room in the castle.
240 5 0 Act IV , Scene IV: A plain in Denmark.
245 5 0 Act IV , Scene V: Elsinore. A room in the castle.
250 5 0 Act IV , Scene VI: Another room in the castle.
254 5 0 Act IV , Scene VII: Another room in the castle.
257 5 0 Act V
260 5 0 Act V , Scene I: A churchyard.
263 5 0 Act V , Scene II: A hall in the castle.

Act I, Scene I: Elsinore. A platform before the castle. is under Act I because it is prefixed with Act I and then a comma. The website seems to only visually indent the first sub-level of the tree.

@stephskardal
Copy link

https://github.com/stefankroes/ancestry - gem i've used a bit and consider as a possibility here

@stephskardal
Copy link

I don't feel I fully communicated what I was thinking yesterday in terms of a potential proposed change to the documents structure, so I want to communicate that better to get things ironed out.

What I propose is that section represents all the sections for a document. The structure would look something like this:

section (id: 1, ancestry: nil)
-- section (id: 2, ancestry: 1, sort: 1)
---- section (id: 3, ancestry: 1/2, sort: 1)
---- section (id: 4, ancestry: 1/2, sort: 2)
-- section (id: 5, ancestry: 1, sort: 2)
---- section (id: 6, ancestry: 1/5, sort: 1)
---- section (id: 7, ancestry: 1/5, sort: 2)

Then, instead of a documents model, introduce a Metadata model (subject to different name), which has many of the fields document has, plus a section_id field, as it would belong to a section. In my example above of a single document, there would only be one metadata entry that section_id is equal to 1 because it only belongs to that top item.

The benefit of my approach, IMO, is a two fold:
a) Only one model (section) fully represents the hierarchy of content
b) We can leverage an activeadmin nestable plugin to provide the admin interface for setting the hierarchy for documents (it wouldn't work with Document + Section model)

Another benefit of this approach is then, on the single document view page, we could call that top level section children to get those immediate children (ie section > section).

Thoughts?

@stephskardal
Copy link

I suppose we could also keep the existing documents data model as is and simply add a section_id to it so that it would be the acting "Metadata" model, and then we would create a root section instance for each document instance.

@atrigent
Copy link
Member Author

atrigent commented Jun 5, 2014

I definitely agree that it makes the most sense for the root node of the ancestry tree to represent the entire work. I just did a quick google about having multiple models in an ancestry tree, and came up with this: stefankroes/ancestry#155 question on github. The suggestion given was to use single-table inheritance. Since all nodes would be in the same table, their ids would all come from the same "id-space", and ancestry would work as expected.

Here's a proposed class hierarchy. Let me know if I'm overthinking this. Names are subject to change, of course.

DocumentNode - id, title, ancestry
    Work - metadata about an entire work (author, etc)
    Section - sort number
        ContentSection - text content

So the root of the tree would be a Work, sections which include other sections, such as a chapter, would be a Section, and a section which contains content would be a ContentSection. I'm not sure if there's a way to enforce certain invariants, such as that Work has no parents, ContentSection has no children, etc. Perhaps this could be done with validations?

@stephskardal
Copy link

I don't see how that github question / answer / reference to STI helps solve what we are trying to do. The reference to STI is implying we are saving a type to specify what type the ancestry is, but I don't see how has_ancestry accommodates that type based on what I know of has_ancestry. Your proposed data model has 4 different models, which seems overly complex to me. I understand the separation between Section & ContentSection in that you are proposing that the single responsibility of the Section table be to store the ancestry, while all the other data for the content is stored elsewhere, but IMO, DocumentNode & Work don't need to be separate models.

I would go back to the following data model, to keep it most consistent with what we have now:
Models (2): Document, Section.

  • Document contains all the meta data for a document.
  • Section contains the ancestry, title, and content for each section.
  • Upon creation of Document, a root Section instance is created. All children sections / hierarchy are tied to that Section.

I think there are really multiple routes to take here and there is no "best" answer, ie pros & cons to various approaches. I prefer the more simple approach as I've described because of how I know this will integrate with an admin interface and be manageable.

@atrigent
Copy link
Member Author

atrigent commented Jun 5, 2014

My understanding is that the type of a row would specify the type of that row.

I made DocumentNode and Work separate so that only the root node would contain the metadata for the overall work.

@stephskardal
Copy link

Ok, sorry I mispoke about STI. Here's a quick blog post I referenced: http://maulanaruby.wordpress.com/2007/02/17/sti-vs-polymorphic-association/. Even if STI is used for has_ancestry, I'm still unclear as to how to use has_ancestry with it.

In your model, how is STI applied? I'm still not seeing it.

@stephskardal
Copy link

The approach / question I think I'm trying to get at here, is how important is it to utilize an existing Admin framework (activeadmin). The more complex, STI / polymorphism included, the more custom admin interface will need to be built out b/c I do not expect it to support a complex data model. In my mind, I see a huge value in leveraging an existing admin interface and erring on the side of simplicity.

@atrigent
Copy link
Member Author

atrigent commented Jun 5, 2014

Ok, perhaps I'm making some assumptions here. Here's my understanding of how this would work. Let me know if I'm off-base.

Essentially, the methods that ancestry adds to a model just do various parsing and querying based on the ancestry paths. From these, it figures out an id or a list of ids that the user is interested in. Them I'm assuming that it would ask activerecord to fetch these ids from the database, and I'm assuming that activerecord would then instantiate the correct model based on the type field. ancestry would not have to have any idea what the type field means or which actual models are used for any of the objects it is fetching.

I don't think the admin interface would look much different with single-table inheritance. It doesn't look like the admin interface handles ancestry in any special way at the moment - it just treats the ancestry field as a text string. What would the admin interface not be able to handle that it handles now?

@stephskardal
Copy link

Here's a nestable plugin for activeadmin: https://github.com/nebirhos/activeadmin-sortable-tree.

Those are all correct assumptions, except that I don't believe has_ancestry provides the ability for you to specify which type of model it's pulling the instance from, meaning has_ancestry always looks in the same model for the id. This tells me that we cannot have a Section instance reference any other type of model as the root node.

@atrigent
Copy link
Member Author

atrigent commented Jun 5, 2014

I was thinking that you would put the has_ancestry in the DocumentNode and then it would always do DocumentNode.find(num), for example, and activerecord would then instantiate the correct model.

Where I'm coming from with this more complex proposal is that I tend to like for the storage of the data to reflect its structure. This allows me to be more sure of the integrity of the data. The downside is, of course, the complexity, and also that changing the structure can be difficult if that becomes necessary.

@btbonval
Copy link
Member

btbonval commented Jun 5, 2014

@atrigent Unfortunately we're working with Rails which doesn't believe in database logic. It tries to pull as much logic into the application as possible. In theory it allows them to swap out database backends more easily instead of relying on a universal ORM that also understands database schema.

Also unfortunately, we're working with Rails and we want to follow convention for future folks. If the Gem manages the database structure, then we should let it, even if it does so poorly.

If you've ever worked with Drupal, you'll be quite happy with what Rails does manage.

@btbonval
Copy link
Member

btbonval commented Jun 5, 2014

So it sounds like we do what @stephskardal was suggesting, and use https://github.com/mbleigh/acts-as-taggable-on to add any additional metadata (in an integrity-checked but flexible way) if necessary to support the concerns of @atrigent

@stephskardal
Copy link

Some updates / proposed data model:
Models:

  • Node, or whatever, with id, ancestry fields, contains ancestry data only
  • Document (as is now), which contains root level meta data, minimal changes to existing Document model, and a node_id that it belongs_to
  • ContentSection, with title, content, sort fields, and a node_id that it belongs to

ActiveRecord Callback: Upon creation of Document, a root Node is created.

Validations needed:

  • Document.section_id must point to a root section
  • ContentSection.section_id must point to a non-root section

As @btbonval suggests, we can uses acts-as-taggable-on for additional unstructured metadata.

@atrigent
Copy link
Member Author

atrigent commented Jun 6, 2014

I used the following lovely query:

select count(*),
       (select count(*) from content where section_id = sections.id) as num_contents
from sections
group by num_contents

to verify that the relationship between sections and contents is 1:1. There's probably a nicer looking query for doing this, but I think this works, so oh well. The result was:

count(*) num_contents
246 0
12423 1

This shows that a section only ever has zero or one contents attached to it. If it has zero it represents a section with subsections, otherwise it is a section with content. As expected, there are more sections with content than sections with subsections.

@btbonval
Copy link
Member

btbonval commented Jun 6, 2014

That's a good result, @atrigent , but remember we were thinking that chapters, sections, and subsections might have all been handled in title.

Do we want to perform a more accurate reflection of the document structure or simply mirror what the old database did? I'd prefer we do our best to reflect the true document structure, but then again, that means we'd have to figure out if there is consistency in titling and it means we'd have to hand check a whole lot more.

@atrigent
Copy link
Member Author

atrigent commented Jun 6, 2014

I'm looking into that now.

@atrigent
Copy link
Member Author

Oh wow, ok. Apparently this table exists:

CREATE TABLE `section_parents` (
  `parent_id` int(10) unsigned NOT NULL,
  `child_id` int(10) unsigned NOT NULL,
  KEY `parent_id` (`parent_id`,`child_id`),
  KEY `child_id` (`child_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

I'm not sure how I managed to not notice this before. Looking into its properties now.

@btbonval
Copy link
Member

Easy to miss. There are a lot of tables and a lot of PHP files :(

That appears to be the relation I was looking for when we were discussing
the tables you pasted into the ticket. Good news, generally :)

On Tue, Jun 10, 2014 at 10:42 AM, Ari Entlich [email protected]
wrote:

Oh wow, ok. Apparently this table exists:

CREATE TABLE section_parents (
parent_id int(10) unsigned NOT NULL,
child_id int(10) unsigned NOT NULL,
KEY parent_id (parent_id,child_id),
KEY child_id (child_id)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

I'm not sure how I managed to not notice this before. Looking into its
properties now.


Reply to this email directly or view it on GitHub
#26 (comment)
.

@atrigent
Copy link
Member Author

As you might expect, every section has 0 or 1 parents:

select count(*),
       (select count(*) from section_parents where child_id = sections.id) as num_parents
from sections
group by num_parents;
count(*) num_parents
2747 0
9922 1

@btbonval
Copy link
Member

That makes sense. What about layers?

SELECT count(*)
FROM section_parents AS grandparent, section_parents AS parent
WHERE grandparent.child_id = parent.parent_id;

Let's see if there is any case of a hierarchical nesting, or if it's all just parent/child.

@atrigent
Copy link
Member Author

2341 rows, which seem to be spread out over 38 works.

@atrigent
Copy link
Member Author

There are some interesting cases here - for example, it looks like some subsection names are not in fact prefixed by their parent's names. Here's an example: http://www.thefinalclub.org/work-overview.php?work_id=29 . You can't tell from this page, but if you click on some of the sections it will show you their full names.

@btbonval
Copy link
Member

Yeah, it seems like the name doesn't imply anything. The structure is entirely contained in the sections_parents table.

@atrigent
Copy link
Member Author

In the interest of verifying basic invariants: every pair of sections linked by the section_parents table belong to the same work:

select count(*)
from section_parents,
     sections as parent,
     sections as child
where section_parents.parent_id = parent.id and
      section_parents.child_id = child.id and
      parent.work_id != child.work_id;
count(*)
0

@btbonval
Copy link
Member

I would not have thought to check that, but that's good science. It looks
like works are atomically divisible in the current PHP database.

On Thu, Jun 12, 2014 at 10:13 AM, Ari Entlich [email protected]
wrote:

In the interest of verifying basic invariants: every pair of sections
linked by the section_parents table belong to the same work:

select count(*) from section_parents,
sections as parent,
sections as child
where section_parents.parent_id = parent.id and
section_parents.child_id = child.id and
parent.work_id != child.work_id;

count(*) 0


Reply to this email directly or view it on GitHub
#26 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants