Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

Motivated by a recent mock keynote debate, “A Matter of Scale,” presented by Matt Jockers and Julia Flanders as part of the Boston Area Days of Digital Humanities Conference, and the imperative that librarians involved with many things “digital” learn not only how to build tools, in this case for textual analysis, but leverage existing tools to support teaching and research endeavors rooted in the text. Coming from the tool-building perspective and tradition, I seldom have time to explore existing tools for textual analysis. This is partly because at IU we are so vested in textual markup following the TEI Guidelines for which few external tools exist that act on the markup (thus our focus on building). But as is the case with many academic libraries attempting to balance scale of digital production, we are not always in the position to build boutique interfaces, tools and functions for hand-crafted markup. Further, often early research inquiries can be better defined if not answered by initially playing and experimenting with raw data sets before embarking on markup. Finally, after many years of leading e-text initiatives and championing the TEI, I would love to sit around with folks and compare and contrast, not just the possibilities, but also the outcomes of real research inquiries that formed the basis for many of the TEI collections I am offering up to the community for experimentation. In other words, what can we ascertain without/beyond the markup and can those very queries yield answers regardless of the markup?

The other motivator for this session is two-fold. At IU we’ve always exposed the TEI/XML, but at the most atomic level. I am exploring workflows moving forward in which we batch not only the TEI but other versions of the data, primarily plain text, for easier harvesting and re-purposing. One reason for doing this – there are many good ones – is that we want to demonstrate to our faculty partners the possibilities of sharing data in this way. The content can and should be analyzed, parsed, and remixed outside of the context of it’s collection site for broader impact and exposure. I am hoping, with your help, to figure out how to best push versions of this data into the flow, around a more formal call, initially, to the digital humanities community-at-large so I can track the various morphings and instantiations of this data to share back with the IU community, especially my faculty partners.

I recently blogged about this very concern on Day of DH 2013. So this is the first step of a multi-step process that I would like to see culminate in a greater unleashing of XML and plain text data (later summer / early Fall?). And would love your input and contributions.

This session is by no means limited to the following e-text data sets I will provide (data access details forthcoming):

Indiana Magazine of History (one the nation’s oldest scholarly historical journal, 1905-2011; TEI P4)
Victorian Women Writers Project (1830-1929; TEI P5)
Indiana Authors and Their Books ( 1850-1929; TEI Lite P4)
Brevier Legislative Reports (transcripts from Indiana Legislature, 1858-1887; TEI P5)
Wright American Fiction (undergoing web site migration, 1851-1875; TEI P5)

Serve up or use your own e-text data sets of interest.

Nor is it limited by the following tools I have identified for starters:

TXM (see also wiki.tei-c.org/index.php/TXM) for TEI files; installation required
PhiloLogic/PhiloMine also for TEI files but maybe too much overhead to get started; installation required
Mallet for topic-modeling; see also recent article by Elijah Meeks and Scott B. Weingart:journalofdigitalhumanities.org/2-1/dh-contribution-to-topic-modeling/
VUE for visualization and other analyses; registration and installation required
Voyant for textual analysis; web-based
MONK for web-based textual analysis of pre-defined data sets (MONK collections which includes Wright American Fiction or CIC collections)

In fact, it would be best to partner up with folks who are a familiar with a particular tool. Vote for this session and come to this session, claim a tool!

PS All data will be posted on this public-facing wiki page: wiki.dlib.indiana.edu/x/WYK2Hg.

PPS I would like to thank my intern extraordinaire, Beth Gucinski, our server admin, Brian Wheeler, and the smartest lead developer ever, Randall Floyd. Thanks for putting up with all my last minute requests. You guys rock.

1 Response to Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

Krista White says:

April 12, 2013 at 1:51 pm

Here is the GoogleDoc for our seession:

docs.google.com/document/d/1Z2PlL-U7zTI1ONfZLnvvfUTPEUaDDqPBet3q6iyEF_A/edit?usp=sharing

Comments are closed.

Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

About Michelle Dalmau

1 Response to Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

Details

Recent Posts

Categories

Archives

Main THATCamp Blog

Recent Comments