Quantitative Text Analysis
Jonathan Slapin, University of Houston
10 - 21 August (two week course / 35 hrs)
Jonathan Slapin will join the University of Essex as Professor in the Department of Government. Previously he was Associate Professor of Political Science at the
University of Houston and Lecturer at Trinity College, Dublin. His research interests are in quantitative comparative politics and political institutions in European democracies. His most recent book,
co-authored with Sven-Oliver Proksch, is entitled “The Politics of Parliamentary Debate: Parties, Rebels and Representation” and is published by Cambridge University Press.
- The course surveys methods for systematically extracting quantitative information from text for social scientific purposes, starting with classical content analysis and dictionary-based methods, to classification methods, and state-of-the-art scaling methods and topic models for estimating quantities from text using statistical techniques. The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. The common focus across all methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features---such as coded content categories, word counts, word types, dictionary counts, or parts of speech---and converting these into quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically surveys these methods in a logical progression, with a very practical hands-on approach where each technique will be applied in lab sessions using appropriate software, on real texts.
- The course is also designed to cover many fundamental issues in quantitative text analysis such as inter-coder agreement, reliability, validation, accuracy, and precision. It focuses on methods of converting texts into quantitative matrixes of features, and then analysing those features using statistical methods. The course briefly covers the qualitative technique of human coding and annotation (classical content analysis), but the main focus is on more automated approaches. These automated approaches include dictionary construction and application, classification and machine learning, scaling models, and topic models. For each topic, we will systematically cover published applications and examples of these methods, from a variety of disciplinary and applied fields, including political science, economics, sociology, media and communications, marketing, finance, social policy, and health policy. Lessons will consist of a mixture of theoretical grounding in content analysis approaches and techniques, with hands on analysis of real texts using content analytic and statistical software.
- Students in this course should have prior knowledge in the following areas:
- 1) An understanding of probability and statistics at the level of an intermediate postgraduate social science course. Understanding of regression analysis is presumed. Some basic understanding of maximum likelihood would be useful. This course is not heavily mathematical or statistical but students without the prerequisite level of quantitative experience will find the second week (in particular) difficult to follow.
- 2) Basic familiarity with the R statistical language. Stata may also be used but the lab sessions will be designed to use R coupled with a customized R library designed by Ken Benoit. This is in development and available from http://github.com/kbenoit/quanteda.
Representative Backround Reading
- Laver, Michael, Kenneth Benoit and John Garry (2003) “Extracting Policy Positions for Political Texts Using Words as Data” American Political Science Review 97(2): 311-32
- Slapin, Jonathan B and Sven-Oliver Proksch (2008) “A Scaling Model for Estimating Time-Series Party Positions from Texts” American Journal of Political Science 52(3):705-722.
Required Texts (provided to registered participants on arrival)
- Krippendorff, K. (2013). Content Analysis: An Introduction to Its Methodology. Sage, Thousand Oaks, CA, 3rd edition