Big Data and the Road to Scientific Maturity

Public defence of PhD thesis by Johanna Einsiedler.

Assessment committee

  • Associate Professor Karen-Inge Karstoft, University of Copenhagen
  • Associate Professor Eric James Auerbach, Northwestern University
  • Associate Professor Clemens Stachl, University of St. Gallen

Supervisor

  • Associate Professor Andreas Bjerre-Nielsen, University of Copenhagen

 

This dissertation explores different approaches to leveraging big data to enhance the credibility of social science research by improving the falsifiability, generalizability, and precision of theories. While empirical social science aspires to provide predictive and causal insights, its progress is often hindered by high causal density, vague theoretical formulations, and operationalization challenges. Although big data is not a definitive solution to these challenges, it offers a valuable tool to mitigate some issues by enabling large-scale analyses, reducing reactivity biases, and allowing for more granular examinations of social phenomena. Across four studies, I demonstrate how leveraging diverse big data sources—ranging from administrative registers to digital trace data—can contribute to improving social science research. First, I show how integrating register data with personality research helps assess sample representativeness and the risk of bias due to selective participation. Second, I investigate how the introduction of large language models (LLMs) has reshaped online labor markets based on job ad data, demonstrating the capacity of computational techniques to convert large, unstructured datasets into valuable research data. Third, I explore methods for categorizing digital actors in online environments, comparing algorithmic approaches to enhance research efficiency and scalability. Finally, I leverage mobile phone traces in combination with new econometric approaches to examine how randomized group assignments shape student’s social networks, offering new insights into the roles of social foci, homophily and triadic closure in network formation. Beyond empirical contributions, this dissertation also advances the methodology for working with big data in social research. I develop and apply novel techniques—including natural language processing pipelines, machine learning classifiers, and network estimation methods—to address challenges related to data preprocessing, measurement validity, and computational complexity. While big data is not a remedy for all of the challenges facing social science, its thoughtful application provides new opportunities to increase the rigor, reproducibility, and theoretical clarity of the discipline.