These slides are from a presentation I made at the Duke Libraries on September 20, 2012 as part of their Text > Data speaker series, and again on March 5, 2013 as part of their RCR Forum series. You may also be interested in this archive of Adeline Koh’s live-tweeting of the presentation.
Below is a list of resources suitable for further exploration of the topics I covered. This list is incomplete and heavily biased toward text analysis in the humanities rather than the social sciences more broadly. However, it should provide a good starting point for learning about text analysis research methods.
- The edited volume A Companion to Digital Humanities is freely available online and has several chapters focused on text analysis in the humanities.
- Susan Hockey’s Electronic Texts in the Humanities is comprehensive look at the history and practice of digital text encoding in the humanities.
- Michael Piotrowski’s Natural Language Processing for Historical Texts covers everything from acquiring digitized historical texts to text encoding and annotation schemes and natural language processing tools for historical languages.
- Literary and Linguistic Computing is the primary journal publishing humanist scholarship that uses computational text analysis. It also publishes the proceedings of the annual Digital Humanities Conference.
- Digital Humanities Quarterly publishes a wide array of digital humanities scholarship, including some articles focused on text analysis.
- Ted Underwood’s Where to Start with Text Mining is a gentle introduction to text mining, originally developed for a conference of Romanticists.
- Justin Grimmer and Brandon Stewart have a very accessible pre-publication article on Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts, which I drew heavily upon for this workshop.
- Brendan O’Connor, David Bamman, and Noah Smith’s article on Computational Text Analysis for Social Science gets rather technical, but the introduction provides a good overview and bibliography of computational text analysis in the humanities and social sciences.
- D. Sculley and Bradley Pasanek’s article Meaning and mining: the impact of implicit assumptions in data mining for the humanities offers some recommended best practices for making results from text mining as meaningful as possible.
- CasualConc is a text concordancing tool for Mac OS. (This is the one I demonstrated in the workshop.) AntConc is a comparable tool for Windows.
- MALLET is widely used for topic modeling. MALLET itself is intended for use from the command line; but there is a separate graphical user interface available (I also demonstrated this in the workshop). An alternative to MALLET is the Stanford Topic Modeling Toolbox.
- Bamboo Digital Research Tools (DiRT) lists hundreds of useful tools, including many for text mining.
- Stanford University’s Tooling Up for Digital Humanities is an excellent resource and includes accessible introductions to digitization and text analysis.
- The Programming Historian is a community-driven collaborative textbook with lessons covering topics from working with text files to topic modeling.
- The advice offered in Getting Started in the Digital Humanities is not specific to text analysis but is useful nonetheless.
- Sapping Attention is the blog of Ben Schmidt. Schmidt is a graduate student in history at Princeton University, and the Visiting Graduate Fellow at the Cultural Observatory at Harvard, where he helped create Bookworm. He uses text mining of large corpora to study the history of concepts, and blogs regularly about the techniques he uses.
- The Stone and the Shell is the blog of Ted Underwood. Underwood is an English professor at the University of Illinois using text mining to study eighteenth- and nineteenth-century literature. He blogs his experiments, often providing data and code as well.
- Lisa @ Work is the blog of Lisa Rhody, a Ph.D. candidate in English at the University of Maryland who is applying computational text analysis to ekphrastic poetry (poems that take the visual arts as their subject) by contemporary women poets.
- Scott Weingart is a doctoral student in the digital humanities with a knack for explaining complicated topics on his blog. See for example his post on Topic Modeling for Humanists.
- Computer scientist and linguist Chris Manning taught a tutorial on Natural Language Processing Tools for the Digital Humanities.
- Political scientist Justin Grimmer taught a course on Text as Data at Stanford University.
- Political scientist Kenneth Benoit taught a summer school course in Computer-Assisted Text Analysis at the University of Essex.
- Digital humanist Matthew Jocker taught a course focusing on computational analysis of works written by Virginia Woolf.
- Digital humanist Willard McCarty taught a course on corpus-based text analysis at King’s College London.
- See more computational / algorithmic / quantitative humanities syllabi at the scottbot irregular.