Talks/2020-02-darmstadt.html

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:t="http://www.tei-c.org/ns/1.0" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta><title>Modelling meaning: the role of data in the humanities </title><meta name="generator" content="Generated by TEISLIDY stylesheet"></meta><script src="https://www.w3.org/Talks/Tools/Slidy/slidy.js" type="text/javascript"></script><link rel="stylesheet" type="text/css" media="screen, projection" href="https://www.w3.org/Talks/Tools/Slidy/show.css"></link><link href="../css/egXMLhandling.css" rel="stylesheet" type="text/css"></link><link href="../css/tei.css" rel="stylesheet" type="text/css"></link></head><body class="simple" id="TOP"><div class="slide cover"><img src="media/logo.png" width="40%" style="float:left" alt="[Put logo here]" class="cover"></img><br clear="all"></br><h1>Modelling meaning: the role of data in the humanities </h1><h3 class="sub">He who sees the Infinite in all things sees God. He who sees the Ratio
          only, sees himself only</h3><p>Lou Burnard</p></div><div class="slide"><h2>The brief</h2><ul><li class="item">What is the status of data in the humanities and social science research of our time? </li><li class="item">How has data, its encoding and markup, changed the way discourse studies perceive their subject? </li><li class="item">Does data driven discourse research change its role in society? </li><li class="item">How does data impact scientific understanding and what is its role in epistemological processes? </li><li class="item">What is the role of data in field work between experience, encounter, and interaction? </li><li class="item">What ethical implications arise in handling research data? </li><li class="item">In what direction will (and should) discourse research develop and what role will data play in it? </li></ul><p>(Sorry, I won't answer all these questions)</p></div><div class="slide"><h2>The status of data in current SHS research</h2><ul><li class="item">Data is omnipresent, but not entirely omniscient</li><li class="item">Some disciplines are almost entirely data-dependent (e.g. corpus linguistics, stylometrics)</li><li class="item">In others (e.g. literary studies) data-dependence remains controversial</li><li class="item">The massive expansion of data availability has lead to a recognition of the centrality of data-modelling</li><li class="item">... but there are two kinds of modelling : <ul><li class="item">in the traditional humanities, a model is a means of abstraction and a set of categories: a <em>reductive</em> process</li><li class="item">in the social sciences, a model is a tool for prediction and generalisation : an <em>analytic</em> process</li></ul></li></ul><p class="box">(see e.g. Flanders and Jannidis 2019)</p></div><div class="slide"><h2>Encoding data for discourse studies</h2><ul><li class="item">(CAVEAT: I am not a discourse studies person)</li><li class="item">discourse components cross multiple levels</li></ul><p>Which level/s of description do we favour? </p><div class="figure"><img src="media/oral-annot.png" alt="" class="graphic"></img></div></div><div class="slide"><h2>Multiple levels can coexist in XML</h2><div class="figure"><img src="media/steven-xml.png" alt="" class="graphic"></img></div></div><div class="slide"><div class="frame"><div class="col"><h2>But in practice, linguists seem to prefer fairly simple -- reductive -- data categorisations </h2><p>A choice must be made...</p><div class="figure"><img src="media/spokenBits.png" alt="" class="graphic"></img></div></div><div class="col"><div class="figure"><img src="media/dartExample.png" alt="" class="graphic" style=" height:89%;"></img></div></div></div></div><div class="slide"><h2>Is there any such thing as a "pure" transcription?</h2><p>A language corpus consists of samples of authentic language productions ... </p><ul><li class="item">selected according to explicit principles, for specific goals</li><li class="item">represented in a digital form</li><li class="item">generally enriched with metadata and annotation beyond "pure" transcription </li></ul><p class="box">Can there be any re-presentation without interpretation? </p></div><div class="slide"><h2>Annotation: necessary evil or fundamental ?</h2><p class="box"><span class="q">‘Annotation ... is anathema to corpus-driven linguists.’</span> (Aarts, 2002)</p><p class="box"><span class="q">‘The interspersing of tags in a language text is a perilous activity, because the text thereby loses integrity.’</span> (Sinclair, 2004)</p><p class="box"><span class="q">‘… the categories used to annotate a corpus are typically determined before any corpus analysis is carried out, which in turn tends to limit ... the kind of question that usually is asked.’</span> (Hunston, 2002)</p><ul><li class="item">transcription represents the transcriber's understanding of the source</li><li class="item">encoding/annotation represents the annotator's intuitions about it, in a codified form</li><li class="item">how are these different?</li></ul><p class="box"><span class="q">‘... all encoding interprets, all encoding mediates. There is no 'pure' reading experience to sully. We don't carry messages, we reproduce them –– a very different kind of involvement. We are not neutral; by encoding a written text we become part of the communicative act it represents. ’</span> (Caton 2000)</p></div><div class="slide"><h2>A naive realist's manifesto</h2><div class="figure"><img src="media/model.png" alt="" class="graphic"></img></div><p>How do we keep the virtuous hermeneutic circle turning? </p><p class="box">Modelling matters</p></div><div class="slide"><h2>How did we get here from there?</h2><p>Let's (briefly) go back to the unfamiliar world of the mid-1980s... </p><ul class="pause"><li class="item">the world wide web did not exist</li><li class="item">the tunnel beneath the English Channel was still being built</li><li class="item">a state called the Soviet Union had just launched a space station called Mir</li><li class="item">serious computing was done on mainframes </li><li class="item">the world was managing quite nicely without the DVD, the mobile phone, cable tv, or Microsoft Word</li></ul></div><div class="slide"><h2>...but also a familiar one</h2><ul><li class="item">corpus linguistics and <span class="q">‘artificial intelligence’</span> had created a demand for large scale textual resources in academia and beyond</li><li class="item">advances in text processing were beginning to affect lexicography and document management systems (e.g. TeX, Scribe, (S)GML ...)</li><li class="item">the Internet existed for academics and for the military; theories about how to use it <span class="q">‘hypertextually’</span> abounded</li><li class="item">books, articles, and even courses in something called "Computing in the Humanities" were beginning to appear</li></ul></div><div class="slide"><h2>Modelling the data vs modelling the text</h2><p>By the end of the 1970s, methods variously called <span class="q">‘data modelling’</span>, <span class="q">‘conceptual analysis’</span>, <span class="q">‘database design’</span> vel sim. had become common practice.</p><ul><li class="item">remember: a centralised mainframe world dominated by IBM</li><li class="item">spread of office automation and consequent data integration</li><li class="item">ANSI SPARC three level model</li></ul><div class="figure"><img src="media/ansi-sparc.png" alt="" class="graphic"></img></div></div><div class="slide"><h2>An inherently reductive process</h2><div class="figure"><img src="media/ansi-sparc-2.png" alt="" class="graphic"></img></div><p class="box">how applicable are such methods to the complexity of humanities data sources?</p></div><div class="slide"><h2>The 1980s were a period of technological enthusiasm</h2><ul><li class="item">Digital methods and digital resources, despite their perceived strangeness were increasingly evident in the Humanities </li><li class="item">There was some public funding of infrastructural activities, both at national and European levels: in the UK, for example, the <span class="titlem">Computers in Teaching Initiative</span> and the <span class="titlem">Arts and Humanities Data Service</span></li><li class="item">Something radically new, or just an update ? </li><li class="item">Humanities Computing (aka Digital Humanities) gets a foothold, by establishing courses </li><li class="item">Thaller (in 1989) challenges advocates of <span class="q">‘Humanities Computing’</span> to define its underlying theory</li><li class="item">Unsworth and others (by 2002) start using the phrase ”scholarly primitives” to characterise a core set of procedures sustaining something called <span class="q">‘Digital Humanities’</span></li></ul><div class="figure"><img src="media/slide20.png" alt="" class="graphic" style=" height:30%;"></img></div></div><div class="slide"><h2>Where did these digital methods originally thrive?</h2><ul><li class="item">corpus linguistics</li><li class="item">authorship and stylometry</li><li class="item">historical data</li></ul></div><div class="slide"><h2>Corpus Linguistics : searching for meaning </h2><p>How do we identify the components of a discourse which give it meaning ?</p><ul><li class="item">meaning is usage: <span class="q">‘Die Bedeutung eines Wortes ist sein Gebrauch in der Sprache’</span> (Wittgenstein, 1953)</li><li class="item">meaning is collocation <span class="q">‘You shall know a word by the company it keeps’</span> (Firth, 1957)</li></ul><p class="box">the text is the data</p></div><div class="slide"><h2>Stylometrickery</h2><ul><li class="item">text as a bag of words</li><li class="item">statistical analysis of word frequencies to determine authorship and quantify style</li><li class="item">from T C Mendenhall (1887) to J F Burrows (1987)</li><li class="item">... now reborn as distant reading, culturonomics </li></ul><p class="box">the text is the data</p></div><div class="slide"><h2>The re-invention of <span class="foreign">quellenkritik</span></h2><p class="box"><span class="q">‘History that is not quantifiable cannot claim to be scientific’</span> (Le Roy Ladurie 1972)</p><ul><li class="item">In the UK, a series of <span class="hi">History and Computing</span> (1986-1990) conferences showed historians already using commercial DBMS, data analysis tools developed for survey analysis, "personal database systems" ...</li><li class="item">In France, J-P Genet and others influenced by the <span class="hi">Annales</span> school proposed a programme of digitization for historical sources records</li><li class="item">Further pursued by Manfred Thaller with the program <em>kleio</em> (1982) -- a tool for transcribing and analysing (extracts from) historical sources, which included annotation of their content/significance</li></ul><p class="box">the data is extracted from the text</p></div><div class="slide"><h2>How should we model textual data?</h2><div class="figure"><img src="media/txttrin.png" alt="The Textual Trinity (Burnard 1987)" class="graphic" style=" height:70%;"></img><h2>The Textual Trinity (Burnard 1987)</h2></div><blockquote class="quote"><p>In interpreting text, the trained human brain operates quite successfully on three distinct levels; not surprisingly, three distinct types of computer software have evolved to mimic these capabilities.</p></blockquote></div><div class="slide"><h2>Text is little boxes</h2><div class="figure"><img src="media/texQuote.png" alt="(Preliminary description of TEX: D Knuth, May 13, 1977)" class="graphic" style=" width:80%;"></img><h2>(Preliminary description of TEX: D Knuth, May 13, 1977)</h2></div><ul><li class="item"><span class="hi">TeX</span> was developed by Donald Knuth, a Stanford mathematician, to produce high quality typeset output from annotated text</li><li class="item">Knuth also developed the associated idea of <span class="hi">literate programming</span>: that software and its documentation should be written and maintained as an integrated whole</li><li class="item">TeX is still widely used, particularly in the academic community: it is open source and there are several implementations</li></ul></div><div class="slide"><h2>No, text is data</h2><div class="figure"><img src="media/germanDict.jpg" alt="" class="graphic" style=" height:90%;"></img></div></div><div class="slide"><h2>Database orthodoxy</h2><ul><li class="item">identify important entities which exist in the real world and the relationships amongst them</li><li class="item">formally define a conceptual model of that universe of discourse</li><li class="item">map the conceptual model to a storage model (network, relational, whatever...)</li></ul><p class="box">But what are the "important entities" we might wish to identify in a textual resource? </p></div><div class="slide"><div class="frame"><div class="col"><h2>Assize court records, for example</h2><div class="figure"><img src="media/1671assizes.jpg" alt="" class="graphic" style=" width:80%;"></img></div></div><div class="col"><div class="figure"><img src="media/recogModel.png" alt="(,&#xA;            1980) " class="graphic" style=" width:90%;"></img><h2>(<span class="titlem">An application of CODASYL techniques to research in the humanities</span>, 1980) </h2></div></div></div></div><div class="slide"><div class="frame"><div class="col"><h2>What is a text (really)?</h2><ul><li class="item">content: the components (words, images etc). which make up a document </li><li class="item">structure: the organization and inter-relationship of those components </li><li class="item">presentation: how a document looks and what processes are applied to it </li><li class="item">context: how the document was produced, circulated, processed, and understood</li><li class="item">.. and possibly many other readings</li></ul></div><div class="col"><p>For example: </p><div class="figure"><img src="media/19790809_002v.jpg" alt="" class="graphic" style=" height:60%;"></img></div></div></div></div><div class="slide"><h2>Separating content, structure, presentation, and context means : </h2><ul><li class="item">the content can be re-used </li><li class="item">the structure can be formally validated </li><li class="item">the presentation can be customized for <ul><li class="item">different media </li><li class="item">different audiences</li></ul></li><li class="item">the context can be analysed</li><li class="item">in short, the information can be uncoupled from its processing</li></ul><p>This is not a new idea! But is it a good one? </p></div><div class="slide"><h2>Some ambitious claims ensued </h2><div class="figure"><img src="media/xml-slide.png" alt="(Presentation for Oxford IT Support Staff Conference, 1994)" class="graphic"></img><h2>(Presentation for Oxford IT Support Staff Conference, 1994)</h2></div></div><div class="slide"><h2>A digital text may be ... </h2><p class="box">a <span class="q">‘substitute’</span> (surrogate) simply representing the appearance of an existing document</p><div class="figure"><img src="media/graves-2.png" alt="" class="graphic" style=" height:80%;"></img></div></div><div class="slide"><h2>... or it may be</h2><p class="box">a representation of its linguistic content and structure, with additional annotations about its meaning and context.</p><div class="figure"><img src="media/graves-1.png" alt="" class="graphic" style=" height:80%;"></img></div></div><div class="slide"><h2>Functions of encoding</h2><ul><li class="item">It makes explicit to a processor <em>how</em> something should be processed.</li><li class="item">In the past, ‘markup’ was what told a typesetter how to deal with a manuscript</li><li class="item">Nowadays, it is what tells a computer program how to deal with a stream of textual data.</li></ul><p class="box">... and thus expresses the encoder's view of what <span class="hi">matters</span> in this document, determining how it can subsequently be analysed.</p></div><div class="slide"><h2>Which textual data matters ?</h2><ul class="pause"><li class="item">the shape of the letters and their layout?</li><li class="item">the presumed creator of the writing?</li><li class="item">the (presumed) intentions of the creator? </li><li class="item">the stories we read into the writing? </li></ul><p class="box">A ‘document’ is something that exists in the world, which we can <span class="term">digitize</span>.</p><p class="box">A ‘text’ is an abstraction, created by or for a community of readers, which we can <span class="term">encode</span>.</p></div><div class="slide"><h2>The document as <span class="q">‘Text-Bearing Object’</span>(TBO)</h2><p class="box"><span style="font-style:italic">Materia appetit formam ut virum foemina</span></p><ul><li class="item">Traditionally, we distinguish form and content</li><li class="item">In the same way, we might think of an inscription or a manuscript or even a transcribed recording as the bearer or container or form instantiating an abstract notion -- a text</li><li class="item">but quite a lot of the text is actually all in our head</li></ul><p class="box">And don't forget ... digital texts are also TBOs!</p></div><div class="slide"><h2>Markup is a scholarly activity</h2><ul><li class="item">The application of markup to a document is an intellectual activity</li><li class="item">Deciding exactly what markup to apply and why is much the same as editing a text </li><li class="item">Markup is rarely neutral, objective, or deterministic : interpretation is needed</li><li class="item">Because it obliges us to confront difficult ontological questions, markup can be considered a research activity in itself</li><li class="item">Good textual encoding is never as easy or quick as people would believe -- do things better, not necessarily quicker</li><li class="item">The markup scheme used for a project should result from a detailed analysis of the properties of the objects the project aims to use or create and of their historical/social context </li></ul><p class="box">... though considerations of scale may have an effect ... </p></div><div class="slide"><h2>Because ... </h2><p class="box">Good markup (like good scholarship) is expensive</p><div class="figure"><h2>Big data vs. curated data</h2><img src="media/dataCartoon.png" alt="Big data vs. curated data" class="graphic"></img></div></div><div class="slide"><h2>Choices (1) </h2><p>Consider this kind of object: </p><div class="figure"><img src="media/beowulf-ms.png" alt="BL Ms Cotton Vitelius A xv, fol. 129r" class="graphic" style=" height:80%;"></img><h2>BL Ms Cotton Vitelius A xv, fol. 129r</h2></div></div><div class="slide"><h2>Some typical varieties of curated markup</h2><div class="p"><div id="index.xml-egXML-d30e564" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;hi&nbsp;<span class="attribute">rend</span>="<span class="attributevalue">dropcap</span>"&gt;</span>H<span class="element">&lt;/hi&gt;</span>
<span class="element">&lt;g&nbsp;<span class="attribute">ref</span>="<span class="attributevalue">#wynn</span>"&gt;</span>W<span class="element">&lt;/g&gt;</span>ÆT WE GARDE
<span class="element">&lt;lb/&gt;</span>na in gear-dagum þeod-cyninga
<span class="element">&lt;lb/&gt;</span>þrym gefrunon, hu ða æþelingas
<span class="element">&lt;lb/&gt;</span>ellen fremedon. oft scyld scefing sceaþe
<span class="element">&lt;add&gt;</span>na<span class="element">&lt;/add&gt;</span>
<span class="element">&lt;lb/&gt;</span>þreatum, moneg<span class="element">&lt;expan&gt;</span>um<span class="element">&lt;/expan&gt;</span> mægþum meodo-setl
<span class="element">&lt;add&gt;</span>a<span class="element">&lt;/add&gt;</span>
<span class="element">&lt;lb/&gt;</span>of<span class="element">&lt;damage&gt;</span>
&nbsp;<span class="element">&lt;desc&gt;</span>blot<span class="element">&lt;/desc&gt;</span>
<span class="element">&lt;/damage&gt;</span>teah ...</div> <div id="index.xml-egXML-d30e593" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;lg&gt;</span>
&nbsp;<span class="element">&lt;l&gt;</span>Hwæt! we Gar-dena in gear-dagum<span class="element">&lt;/l&gt;</span>
&nbsp;<span class="element">&lt;l&gt;</span>þeod-cyninga þrym gefrunon,<span class="element">&lt;/l&gt;</span>
&nbsp;<span class="element">&lt;l&gt;</span>hu ða æþelingas ellen fremedon,<span class="element">&lt;/l&gt;</span>
<span class="element">&lt;/lg&gt;</span>
<span class="element">&lt;lg&gt;</span>
&nbsp;<span class="element">&lt;l&gt;</span>Oft <span class="element">&lt;persName&gt;</span>Scyld Scefing<span class="element">&lt;/persName&gt;</span> sceaþena þreatum,<span class="element">&lt;/l&gt;</span>
&nbsp;<span class="element">&lt;l&gt;</span>monegum mægþum meodo-setla ofteah;<span class="element">&lt;/l&gt;</span>
&nbsp;<span class="element">&lt;l&gt;</span>egsode <span class="element">&lt;orgName&gt;</span>Eorle<span class="element">&lt;/orgName&gt;</span>, syððan ærest wearþ<span class="element">&lt;/l&gt;</span>
&nbsp;<span class="element">&lt;l&gt;</span>feasceaft funden...<span class="element">&lt;/l&gt;</span>
<span class="element">&lt;/lg&gt;</span></div></div></div><div class="slide"><h2>... and </h2><div id="index.xml-egXML-d30e620" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;s&gt;</span>
&nbsp;<span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">interj</span>"&nbsp;<span class="attribute">lemma</span>="<span class="attributevalue">hwaet</span>"&gt;</span>Hwæt<span class="element">&lt;/w&gt;</span>
&nbsp;<span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">pron</span>"&nbsp;<span class="attribute">lemma</span>="<span class="attributevalue">we</span>"&gt;</span>we<span class="element">&lt;/w&gt;</span>
&nbsp;<span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">npl</span>"&nbsp;<span class="attribute">lemma</span>="<span class="attributevalue">gar-denum</span>"&gt;</span>Gar-dena<span class="element">&lt;/w&gt;</span>
&nbsp;<span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">prep</span>"&nbsp;<span class="attribute">lemma</span>="<span class="attributevalue">in</span>"&gt;</span>in<span class="element">&lt;/w&gt;</span>
&nbsp;<span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">npl</span>"&nbsp;<span class="attribute">lemma</span>="<span class="attributevalue">gear-dagum</span>"&gt;</span>gear-dagum<span class="element">&lt;/w&gt;</span> ...
<span class="element">&lt;/s&gt;</span></div><p>or even</p><div id="index.xml-egXML-d30e635" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">npl</span>"&nbsp;<span class="attribute">corresp</span>="<span class="attributevalue">#w2</span>"&gt;</span>Gar-dena<span class="element">&lt;/w&gt;</span>
<span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">prep</span>"&nbsp;<span class="attribute">corresp</span>="<span class="attributevalue">#w3</span>"&gt;</span>in<span class="element">&lt;/w&gt;</span>
<span class="element">&lt;w&nbsp;<span class="attribute">pos</span>="<span class="attributevalue">npl</span>"&nbsp;<span class="attribute">corresp</span>="<span class="attributevalue">#w4</span>"&gt;</span>gear-dagum<span class="element">&lt;/w&gt;</span>
<span class="comment">&lt;!-- ... --&gt;</span>
<span class="element">&lt;w&nbsp;<span class="attribute">xml:id</span>="<span class="attributevalue">w2</span>"&gt;</span>armed danes<span class="element">&lt;/w&gt;</span>
<span class="element">&lt;w&nbsp;<span class="attribute">xml:id</span>="<span class="attributevalue">w3</span>"&gt;</span>in<span class="element">&lt;/w&gt;</span>
<span class="element">&lt;w&nbsp;<span class="attribute">xml:id</span>="<span class="attributevalue">w4</span>"&gt;</span>days of yore<span class="element">&lt;/w&gt;</span></div><p>.. not to mention ... </p><div id="index.xml-egXML-d30e651" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely-->
<span class="comment">&lt;!-- ... --&gt;</span><span class="element">&lt;l&gt;</span>Oft <span class="element">&lt;persName&nbsp;<span class="attribute">ref</span>="<span class="attributevalue">https://en.wikipedia.org/wiki/Skj%C3%B6ldr</span>"&gt;</span>Scyld Scefing<span class="element">&lt;/persName&gt;</span>
 sceaþena þreatum,<span class="element">&lt;/l&gt;</span></div><p>or even</p><div id="index.xml-egXML-d30e660" class="pre egXML_valid"><!--This otherwise redundant comment ensures egXMLs format nicely--><span class="element">&lt;l&gt;</span>Oft <span class="element">&lt;persName&nbsp;<span class="attribute">ref</span>="<span class="attributevalue">#skioldus</span>"&gt;</span>Scyld Scefing<span class="element">&lt;/persName&gt;</span> sceaþena þreatum,<span class="element">&lt;/l&gt;</span>
<span class="comment">&lt;!-- ... --&gt;</span>
<span class="element">&lt;person&nbsp;<span class="attribute">xml:id</span>="<span class="attributevalue">skioldus</span>"&gt;</span>
&nbsp;<span class="element">&lt;persName&nbsp;<span class="attribute">source</span>="<span class="attributevalue">#beowulf</span>"&gt;</span>Scyld Scefing<span class="element">&lt;/persName&gt;</span>
&nbsp;<span class="element">&lt;persName&nbsp;<span class="attribute">xml:lang</span>="<span class="attributevalue">lat</span>"&gt;</span>Skioldus<span class="element">&lt;/persName&gt;</span>
&nbsp;<span class="element">&lt;persName&nbsp;<span class="attribute">xml:lang</span>="<span class="attributevalue">non</span>"&gt;</span>Skjöld<span class="element">&lt;/persName&gt;</span>
&nbsp;<span class="element">&lt;occupation&gt;</span>Legendary Norse King<span class="element">&lt;/occupation&gt;</span>
&nbsp;<span class="element">&lt;ref&nbsp;<span class="attribute">target</span>="<span class="attributevalue">https://en.wikipedia.org/wiki/Skj%C3%B6ldr</span>"&gt;</span>Wikipedia entry<span class="element">&lt;/ref&gt;</span>
<span class="comment">&lt;!-- ... --&gt;</span>
<span class="element">&lt;/person&gt;</span></div></div><div class="slide"><h2>Choices (2) </h2><p>How about this kind of object ... </p><div class="figure"><img src="media/londonLib.jpg" alt="A random shelf from the London Library" class="graphic" style=" height:80%;"></img><h2>A random shelf from the London Library</h2></div></div><div class="slide"><h2>The digital library model</h2><p class="box">What can you can do with a million books?</p><ul><li class="item">Text is a bunch of page images backed up with OCR-generated transcription <ul><li class="item">analysable only as a bag of words</li></ul></li><li class="item">A mass of bibliographical data <ul><li class="item">begging the representativeness question</li></ul></li></ul></div><div class="slide"><h2>Distant Reading</h2><div class="figure"><img src="media/canonViz.png" alt="(From Ryan Heuser on twitter recently)" class="graphic" style=" height:75%;"></img><h2>(From Ryan Heuser on twitter recently)</h2></div><p class="box">"Designing a text-analysis program is necessarily an interpretative act, not a mechanical one, even if running the program becomes mechanistic." (Joanna Drucker, <span class="titlem">Why Distant Reading Isn’t</span>(2017)</p></div><div class="slide"><h2>Choices (3) </h2><p>... or this kind of object </p><div class="figure"><img src="media/bigData.jpg" alt="" class="graphic" style=" height:80%;"></img><h2></h2></div></div><div class="slide"><h2>The linked data model</h2><div class="figure"><img src="media/LODmodel.png" alt="Hype?" class="graphic" style=" height:70%;"></img><h2>Hype?</h2></div><p class="box"><span class="q">‘LOD creates a store of machine-actionable data on which improved services can be built... facilitate the breakdown of the tyranny of domain silos ... provide direct access to data in ways that are not currently possible ... provide unanticipated benefits that will emerge later ’</span> (Anon, passim)</p><p>LOD is about linking web pages together... </p><ul><li class="item">The "meaning" of a set of TEI documents may be inherently complex, nuanced, internally contradictory, imprecise.</li><li class="item">The "meaning" of a web page supporting a bit of e-commerce is exhausted by its RDF description</li></ul></div><div class="slide"><h2>Wait ... </h2><ul><li class="item">Just how many markup systems/models does the world need?<ul><li class="item">One size fits all?</li><li class="item">Let a thousand flowers bloom?</li><li class="item">Roll your own! </li></ul></li><li class="item">We've been here before...<ul><li class="item">one construct and many views</li><li class="item">modularity and extensibility</li></ul></li></ul><p class="box">... did someone mention the TEI ?</p></div><div class="slide"><h2>Impact and effects of data-driven research</h2><ul><li class="item">The trend towards open data motivated (partly) by scientistic replicability</li><li class="item">The digital demotic : opening up of interdisciplinary possibilities -- and the cult of the amateur</li><li class="item">Some specific methodological considerations: <ul><li class="item">what are the underlying populations being sampled? </li><li class="item">what kind/s of standardisation work best? </li><li class="item">how reliably can disparate sources be analysed together?</li></ul></li></ul></div><div class="slide"><h2>Representativeness ... of what?</h2><p>Are there more novels published by men than by women in the 19th century? </p><div class="figure"><img src="media/authorship.png" alt="Data from http://www.victorianresearch.org/atcl/" class="graphic"></img><h2>Data from http://www.victorianresearch.org/atcl/</h2></div><p>How should we create a representative sample of this population?</p><ul><li class="item">by number (1 each from the first decade, 8 each from the last)?</li><li class="item">by variability (1 each from each decade)</li></ul><p>But what if we want to consider multiple categories? </p></div><div class="slide"><div class="frame"><div class="col"><h2>Aiming for a balanced corpus </h2><div class="figure"><h2>English</h2><img src="media/balance-eng.png" alt="EnglishData from " class="graphic"></img><h2>Data from <a class="link_ref" href="http://distantreading.github.io/ELTeC/">http://distantreading.github.io/ELTeC/</a></h2></div></div><div class="col"><p>cultural difference or sampling error? </p><div class="figure"><h2>Hungarian</h2><img src="media/balance-hun.jpg" alt="Hungarian" class="graphic"></img></div></div></div></div><div class="slide"><h2>Is the TEI really an ontology?</h2><ul><li class="item">The TEI was originally conceived as a way of modelling the concepts researchers shared about the nature of the texts they wished to process</li><li class="item">Modelling the semantics of that set of concepts formally remains a research project </li><li class="item">Crosswalks or mappings to other "real" ontologies such as OWL are possible: one has been implemented for CIDOC-CRM</li><li class="item">(the TEI provides a hook in the shape of the <span class="gi">&lt;equiv&gt;</span> element)</li></ul><p class="box">But before we can extract or model their content, documents must be interpreted... </p></div><div class="slide"><h2>Data science and textual data</h2><p class="box">.... and (maybe) the same applies to data</p><blockquote class="quote"><p>"All data is historical data: the product of a time, place, political, economic, technical, &amp; social climate. If you are not considering why your data exists, and other data sets don’t, you are doing data science wrong” </p></blockquote><p>[Melissa Terras, Opportunities, barriers, and rewards in digitally-led analysis of history, culture and society. Turing Lecture 2019-03-03, <a class="link_ptr" href="https://youtu.be/4yYytLUViI4"><span>https://youtu.be/4yYytLUViI4</span></a>]</p></div><div class="slide"><h2>A recent example: global reach versus situated context</h2><ul><li class="item">A nice simple well-understood domain : digital collections of 19th century newspapers</li><li class="item">all professionally catalogued with extensive metadata</li><li class="item">all (mostly) accessible via standard interfaces</li></ul><p class="box">"attempts to create a single map of all possible elements and attributes, and to provide provenance of internal structures while grouping object by type and subtype, raised significant ontological issues" (Beals, M. H. et al <span class="titlem">The Atlas of Digitised Newspapers and Metadata</span>, 2020)</p><p>Different institutional catalogues</p><ul><li class="item">describe the same items with differing degrees of completeness</li><li class="item">use similar but not identical terminology</li></ul><p>All institutional collections</p><ul><li class="item">reflect historically situated selection principles </li><li class="item">in particular, selection for digitization is motivated largely by economic considerations</li></ul></div><div class="slide"><h2>Conclusion</h2><div class="figure"><img src="media/behindDoorsUmbertoEco.jpg" alt="Still from  (Wehn-Damisch,2012)" class="graphic" style=" height:50%;"></img><h2>Still from <span class="titlem">Behind the Doors</span>(Wehn-Damisch,2012)</h2></div><p>Umberto says:</p><ul><li class="item">In spite of the obvious differences in degrees of certainty and uncertainty, every picture of the world (be it a scientific law or a novel) is a book in its own right, open to further interpretation. </li><li class="item">But certain interpretations can be recognised as unsuccessful because they are like a mule, that is, they are unable to produce new interpretations or cannot be confronted with the traditions of the previous interpretation. </li><li class="item">The force of the Copernican revolution is not only due to the fact that it explains some astronomical phenomena better than the Ptolemaic tradition, but also to the fact that – instead of representing Ptolemy as a crazy liar – it explains why and on which grounds he was justified in outlining his own interpretation</li></ul><p>We conclude...</p><p class="box"><span class="quote">‘Text is not a special kind of data: data is a special kind of text’</span></p></div></body></html>