-
Notifications
You must be signed in to change notification settings - Fork 11
IIP Roadmap Winter Spring 2020
The IIP project uses Trello and issue tracking in the code and data github repositories to plan and track work.
- Data Repository: https://github.com/Brown-University-Library/iip-texts
- Code Repository: https://github.com/Brown-University-Library/iip-production
- Trello Board (check with MS before sharing)
The search and browse interface and framework for IIP are stable at this point. The project is focussing work in two major areas this semester.
- Continuing small additions and formatting adjustments to the current site, and regularization of the encoded inscriptions.
- Continuing work towards developing an interface for exploring word lists, lemmatized concordances and other types of linguistic information.
- In preparation for Gaia Lembi's departure, reviewing and supplementing the IIP encoding documentation to make sure that it is up to date.
The linguistic tools in IIP were initially developed by Luna McNulty, using NLTK, under the supervision of Ashley Champagne. This resulted in a first version of a world list interface and a doubletree ngram viewer. During spring and summer of 2019, Christian Casey worked on refining the process of tokenizing, tagging and lemmatizing the inscription texts. The group developed a data structure to represent each word with location, POS and lemmatization information from several passes with different parsing methods. We now have a workflow for all 4 languages, and a set of corrected, usable results for Latin. The plan for this spring is to use the results from the Latin parsing to create and test new word list interfaces that can be incorporated into the IIP website and serve as prototypes for the other 3 languages. This work is in an exploratory phase. As the components and time allocations become clearer, they will be tracked in Trello (overall tasks) and Github.
People
- Mina Rhee (student developer)
- Birkin Diana (supervision)
- Elli Mylonas, Gaia Lembi (supervision, proofreading, interface)
Workplan
-
MR becomes familiar with IIP development framework. EM, GL provide examples of concordances. Prioritize which format should be the first to be developed.
- Concordance of lemmatized forms
- Concordance of inflected forms
- List of abbreviations
- Identify what features of the markup should be displayed in the concordance (e.g. unclear or supplied letters, regularized/corrected forms)
- Doubletree visualization
-
MR develops concordance(s). The goal is a working Latin concordance component to IIP by end of term, with at least the lemmatized and the inflected lists of words.
As MR isn't familiar with the framework or the tasks yet, it isn't possible to scope the work with any more detail. She will be working about 8 hours a week, some of which may be on other projects.
-
EM writes XSL scripts to put
<w>
elements around word units in the IIP texts. This will be based on Christian's regex-based tokenization code, but is necessary in order to be able to more easily tag new inscriptions, or re-tag existing ones. This is not in the critical path for the interface work.Time estimate: about 30 hours.
-
Future Work for the linguistic interface
-
Add Greek, Hebrew and Aramaic parsing and lemmatization. Greek parsing works partially, but doesn't seem to have adequate training sets. We haven't found any easily available Hebrew and Aramaic POS taggers and lemmatizers. We are in communication with researchers in Israel who may share tools with us.
-
Integrate the lemmatization and POS tagging into the index generation pass, so that when an inscription is added or modified, lexical information is also updated. This is the way that inscription metadata is handled, when a new or modified inscription is pushed to the master Github repository.
-
The IIP interface and search mechanism are well developed and stable. However, as new inscriptions are added, new display handling features become necessary. Additionally, as the markup has been refined, and the interface formatting improved to handle new features, we have discovered bugs in the system. Some interface work has to happen at the level of the Django framework, but most of it can be managed by adding or modifying the CETEIcean CSS code. This work is being tracked in Trello and Github issues.
People
- Birkin Diana (developer)
- Gaia Lembi (proofreading, corrections)
- Elli Mylonas (proofreading, programmatic corrections to source, CSS, Solr index XSL)
Workplan
- Work through Github issues in the iip-texts repository to clean up source files.
- GL and EM. Try to hire a student for some very labor intensive and repetitious corrections in the Location metadata and text content.
- Add CSS to handle markup correctly (mostly documented in IIP-texts issues, some in IIP-production issues)
- Get help from Birkin in cases such as https://github.com/Brown-University-Library/iip-production/issues/55 where a bug is introduced at the framework level which presents as an inscription display problem.
- Create a test inscription which can contain the various types of editorial display we need to make testing easier.
- EM. about 5 hours
The existing encoding documentation is out of date. GL has been working with student encoders and can document a variety of features that were encountered recently, such as monograms and non-unicode glyphs. She can also work through the issues that we have been correcting in the iip-text repository, which represent inconsistencies in encoding of particular characters (e.g. punctuation, Greek number signs) or editorial features (e.g. different types of <gap>
elements). Although it would be better for the documentation to be part of this wiki, it may make more sense to keep it in a Google Doc, as it will be easier for future managers and encoders to update as needed.
People
- Gaia Lembi (writing)
- Elli Mylonas (advising and consulting)
Workplan
- Go through old documentation to clean up and add obvious missing information and add new instructions.
- Add references to Epidoc Guidelines (URLs)
- Add references to relevant Github issues if they give rationale for encoding. (URLs)
- Add references to IIP inscriptions that can serve as examples. (IIP ID)
-
Schedule the semi-annual BDR ingestion of the IIP source files
- Review and update script that creates archival versions of the files. Update Version number. Test.
- Contact Joseph Rhoads to ingest the files.
Effort: EM about 5 hours. JR/BC need to schedule time
-
JTEI paper submission from TEI2019 conference paper. Due March 5.
People: GL, Andrew Creamer, EM.
© Brown University Library, Program in Judaic Studies