Analyzing Big Data
Christopher Fariss, Penn State University
27 July - 7 August (two week course / 35 hrs)
Christopher Fariss is currently the Jeffrey L. Hyde and Sharon D. Hyde and Political Science Board of Visitors Early Career Professor in Political Science and Assistant Professor in the Department of Political Science at Penn State University. To date, he has taught courses on research design, measurement, and human rights. For his research, he uses computational methods to understand why governments around the world choose to torture, maim, and kill individuals within their jurisdiction. Other projects cover a broad array of themes, ranging from foreign aid to American voting behaviour, but share a focus on computationally intensive methods and research design. These methodological tools, essential for analysing ”big data”, open up new insights into the micro-foundations of state repression.
- This course focuses on the research design and analysis tools used to explore and understand “big data”. The fundamentals of research design are the same throughout the social sciences; however the topical focus of this class is on computationally intensive data generating processes and the research designs used to understand and manipulate such data at scale. By massive or large scale, I mean that there are lots of subjects/connections/units/rows in the data (e.g., social network data like the kind available from Facebook or twitter), or there are lots of variables/items/columns in the data (e.g., text data with many thousands of columns that represent the words in the document corpus), or the selected analytical tool is a computationally complex algorithm (e.g., a Bayesian simulation for modelling a latent variable or a random forest model for exploratory data analysis), or finally some combination of these three issues. The course will provide students with the tools to design observational studies and experimental interventions into large and unstructured data sets at increasingly massive scales and at different degrees of computational complexity.
- Students will learn how to design studies to take advantage of the wealth of information contained in new massive scale online datasets such as data available from Facebook, twitter, and many newly digitized document corpuses now available online. The focus of the course is on designing studies in such a way as to maximize the validity of inferences obtained from these complex datasets
- Students should have some familiarity with concepts from research design and statistics. Generally, exposure to these concepts occurs during the first year course at a typical PhD program in political science. Students should have at least some exposure to the R computing environment. The more familiarity with R the better
- Carpenter, Bob, Andrew Gelman, Matt Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A. Mrubaker, Jiqiang Guo, Peter Li, and Allen Riddell. “Stan: A Probabilistic Programming Language.” Journal of Statistical Software
- Gelman, Andrew and Jennifer Hill. 2007. Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge: Cambridge University Press
- Stan Development Team. 2015. “Stan Modeling Language: Users Guide and Reference Manual. Version 2.6.0.” http://mc-stan.org/manual.html
- Please note this book will be provided by the Summer School on participants arrival
- Matloff, Norman. 2011. Art of R Programming: A Tour of Statistical Software Design. no starch press.