Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

Motivated by a recent mock keynote debate, “A Matter of Scale,” presented by Matt Jockers and Julia Flanders as part of the Boston Area Days of Digital Humanities Conference, and the imperative that librarians involved with many things “digital” learn not only how to build tools, in this case for textual analysis, but leverage existing tools to support teaching and research endeavors rooted in the text.  Coming from the tool-building perspective and tradition, I seldom have time to explore existing tools for textual analysis.  This is partly because at IU we are so vested in textual markup following the TEI Guidelines for which few external tools exist that act on the markup (thus our focus on building). But as is the case with many academic libraries attempting to balance scale of digital production, we are not always in the position to build boutique interfaces, tools and functions for hand-crafted markup.  Further, often early research inquiries can be better defined if not answered by initially playing and experimenting with raw data sets before embarking on markup.  Finally, after many years of leading e-text initiatives and championing the TEI, I would love to sit around with folks and compare and contrast, not just the possibilities, but also the outcomes of real research inquiries that formed the basis for many of the TEI collections I am offering up to the community for experimentation.  In other words, what can we ascertain without/beyond the markup and can those very queries yield answers regardless of the markup?

The other motivator for this session is two-fold.  At IU we’ve always exposed the TEI/XML, but at the most atomic level.  I am exploring workflows moving forward in which we batch not only the TEI but other versions of the data, primarily plain text, for easier harvesting and re-purposing.  One reason for doing this – there are many good ones – is that we want to demonstrate to our faculty partners the possibilities of sharing data in this way.  The content can and should be analyzed, parsed, and remixed outside of the context of it’s collection site for broader impact and exposure.  I am hoping, with your help, to figure out how to best push versions of this data into the flow, around a more formal call, initially, to the digital humanities community-at-large so I can track the various morphings and instantiations of this data to share back with the IU community, especially my faculty partners.

I recently blogged about this very concern on Day of DH 2013.  So this is the first step of a multi-step process that I would like to see culminate in a greater unleashing of XML and plain text data (later summer / early Fall?).  And would love your input and contributions.

This session is by no means limited to the following e-text data sets I will provide (data access details forthcoming):

Serve up or use your own e-text data sets of interest.

Nor is it limited by the following tools I have identified for starters:

In fact, it would be best to partner up with folks who are a familiar with a particular tool.  Vote for this session and come to this session, claim a tool!

PS  All data will be posted on this public-facing wiki page: wiki.dlib.indiana.edu/x/WYK2Hg.

PPS  I would like to thank my intern extraordinaire, Beth Gucinski, our server admin, Brian Wheeler, and the smartest lead developer ever, Randall Floyd.  Thanks for putting up with all my last minute requests.  You guys rock.

Categories: Data Mining, Open Access, Session Proposals, Text Mining, Visualization | Tags: , , |

About Michelle Dalmau

I am the Acting Head of Digital Collections Services at the IU Libraries. Previously, I was the Digital Projects Librarian for the Digital Collections Services group and Digital Library Program at the IU Libraries, where I was responsible for coordinating and managing digital library projects with a particular focus on electronic text projects. I have been actively participating the DH community since 2005, but I have contributed to DH-related projects at IU as early as 2002. I am the co-editor of the Victorian Women Writers Project (http://www.dlib.indiana.edu/collections/vwwp/), and participate in many other DH endeavors ... part of the editorial technical staff for DHQ, Co-chair of the TEI Libraries SIG, and other stuffs I can't remember right now.

1 Response to Unleashing TEI and Plain Text Data for Textual Analysis, Visualization and Mining, or, Let’s Play with E-Text Data and Tools

Comments are closed.