2F Quantitative Text Analysis
Kenneth Benoit, London School of Economics and Political Science
22 July - 2 August (two week course / 35 hrs)
Detailed Course Outline [PDF]
Course Content
The course surveys methods for systematically extracting quantitative information from text for social scientific purposes, starting with classical content analysis and dictionary-based methods, to classification methods, and state-of-the-art scaling methods and topic models for estimating quantities from text using statistical techniques. The course lays a theoretical foundation for text analysis but mainly takes a very practical and applied approach, so that students learn how to apply these methods in actual research. The common focus across all methods is that they can be reduced to a three-step process: first, identifying texts and units of texts for analysis; second, extracting from the texts quantitatively measured features—such as coded content categories, word counts, word types, dictionary counts, or parts of speech—and converting these into a quantitative matrix; and third, using quantitative or statistical methods to analyse this matrix in order to generate inferences about the texts or their authors. The course systematically surveys these methods in a logical progression, with a very practical hands-on approach where each technique will be applied in lab sessions using appropriate software, on real texts.
Course Objectives
The course is also designed to cover many fundamental issues in quantitative text analysis such as inter-coder agreement, reliability, validation, accuracy, and precision. It is also designed to survey the main techniques such as human coding (classical content analysis), dictionary approaches, classification methods, and scaling models. It also includes systematic consideration of published applications and examples of these methods, from a variety of disciplinary and applied fields, including political science, economics, sociology, media and communications, marketing, finance, social policy, and health policy. Lessons will consist of a mixture of theoretical grounding in content analysis approaches and techniques, with hands on analysis of real texts using content analytic and statistical software.
Course Prerequisites
Ideally, students in this course will have prior knowledge in the following areas:
- * An understanding of probability and statistics at the level of an intermediate postgraduate social science course. Understanding of regression analysis is presumed. This course is not heavily mathematical or statistical but students without the prerequisite level of quantitative experience will find the second week (in particular) difficult to follow. However, it will be possible to apply all of the methods covered using the WordStat software in all but the the last two sessions of the course, even if the students fail to grasp the full statistical workings of each method
- * Willingness and ability to use the WordStat/QDAMiner software, a commercial package developed by Provalis Research. This software will be used for all but the last two lessons, although the R library (see next item) may also be used for this purpose.
- * Familiarity with the R statistical package. Stata may also be used but the lab sessions will be designed to use R coupled with a customized R library designed by the instructor.
Representative Background Reading
The staple readings (as books) for this course will be Neuendorf (2002) and Krippendorff (2004). Where possible all other readings will be downloadable as pdfs from the course web pages. Reading material will also include rough drafts of the introductory chapters of a text currently being written by the instructor, entitled The Quantitative Analysis of Textual Data for the Social Sciences.
