Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

Michelle Dalmau — Thu, 11 Apr 2013 19:49:06 +0000

Motivated by a recent mock keynote debate, “A Matter of Scale,” presented by Matt Jockers and Julia Flanders as part of the Boston Area Days of Digital Humanities Conference, and the imperative that librarians involved with many things “digital” learn not only how to build tools, in this case for textual analysis, but leverage existing tools to support teaching and research endeavors rooted in the text. Coming from the tool-building perspective and tradition, I seldom have time to explore existing tools for textual analysis. This is partly because at IU we are so vested in textual markup following the TEI Guidelines for which few external tools exist that act on the markup (thus our focus on building). But as is the case with many academic libraries attempting to balance scale of digital production, we are not always in the position to build boutique interfaces, tools and functions for hand-crafted markup. Further, often early research inquiries can be better defined if not answered by initially playing and experimenting with raw data sets before embarking on markup. Finally, after many years of leading e-text initiatives and championing the TEI, I would love to sit around with folks and compare and contrast, not just the possibilities, but also the outcomes of real research inquiries that formed the basis for many of the TEI collections I am offering up to the community for experimentation. In other words, what can we ascertain without/beyond the markup and can those very queries yield answers regardless of the markup?

The other motivator for this session is two-fold. At IU we’ve always exposed the TEI/XML, but at the most atomic level. I am exploring workflows moving forward in which we batch not only the TEI but other versions of the data, primarily plain text, for easier harvesting and re-purposing. One reason for doing this – there are many good ones – is that we want to demonstrate to our faculty partners the possibilities of sharing data in this way. The content can and should be analyzed, parsed, and remixed outside of the context of it’s collection site for broader impact and exposure. I am hoping, with your help, to figure out how to best push versions of this data into the flow, around a more formal call, initially, to the digital humanities community-at-large so I can track the various morphings and instantiations of this data to share back with the IU community, especially my faculty partners.

I recently blogged about this very concern on Day of DH 2013. So this is the first step of a multi-step process that I would like to see culminate in a greater unleashing of XML and plain text data (later summer / early Fall?). And would love your input and contributions.

This session is by no means limited to the following e-text data sets I will provide (data access details forthcoming):

Indiana Magazine of History (one the nation’s oldest scholarly historical journal, 1905-2011; TEI P4)
Victorian Women Writers Project (1830-1929; TEI P5)
Indiana Authors and Their Books ( 1850-1929; TEI Lite P4)
Brevier Legislative Reports (transcripts from Indiana Legislature, 1858-1887; TEI P5)
Wright American Fiction (undergoing web site migration, 1851-1875; TEI P5)

Serve up or use your own e-text data sets of interest.

Nor is it limited by the following tools I have identified for starters:

TXM (see also wiki.tei-c.org/index.php/TXM) for TEI files; installation required
PhiloLogic/PhiloMine also for TEI files but maybe too much overhead to get started; installation required
Mallet for topic-modeling; see also recent article by Elijah Meeks and Scott B. Weingart:journalofdigitalhumanities.org/2-1/dh-contribution-to-topic-modeling/
VUE for visualization and other analyses; registration and installation required
Voyant for textual analysis; web-based
MONK for web-based textual analysis of pre-defined data sets (MONK collections which includes Wright American Fiction or CIC collections)

In fact, it would be best to partner up with folks who are a familiar with a particular tool. Vote for this session and come to this session, claim a tool!

PS All data will be posted on this public-facing wiki page: wiki.dlib.indiana.edu/x/WYK2Hg.

PPS I would like to thank my intern extraordinaire, Beth Gucinski, our server admin, Brian Wheeler, and the smartest lead developer ever, Randall Floyd. Thanks for putting up with all my last minute requests. You guys rock.

Come Play with Ngram Viewer

Judith Arnold — Fri, 05 Apr 2013 13:27:51 +0000

I am a complete novice in this realm but I want to learn. I am hoping there are some more like-minded attendees who will enjoy a truly exploratory session. And I also hope someone out there will propose a session on the Digital Public Library of America (DPLA).

Come Play with Ngram Viewer
A “talk and play” session proposed by Judith Arnold, Wayne State University (Detroit)

One of my colleagues showed me this tool and I was immediately taken with it. I am proposing a Play and Talk session (45 to 60 minutes) where we play with this tool for 20 minutes or so and then spend the rest of the time discussing how different disciplines (or even ourselves in library science) might use this tool. Google Ngram Viewer (books.google.com/ngrams/) visualizes word frequency in the corpus of Google Books. You can chart the occurrence of words over time and even in advanced features determine how the word is used (through part of speech tags), so I can imagine that linguists, for example, would be interested. I experimented with words for an upcoming instruction session on literary studies to see what Ngram Viewer could show me about the prominence of different authors over time and I generated some interesting graphs, which I will bring with me. So bring your iPad or laptop and join me in some exploration, fun, and idea exchanges.

Text Mining – THATCamp ACRL 2013

Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

Come Play with Ngram Viewer