Text as Data

Textual data abound on human communication. We leave textual traces on a great variety of our everyday doings. Our intimate concerns are formulated in google searches, we coordinate community initiatives in Facebook groups and articulate political ideologies on social media platforms. Text has become a fundamental medium through which many people interact, express and position themselves. In Digital Society the analysis, categorization and organization of textual content is off out most importance to governments, businesses, the press and academics alike. The field of computational text analysis is one of the central fields within the wider data revolution. Many methods are being imported into the social science from other fields, especially computer science. But concerns with text models biases and interpretative validity is becoming a growing concern within academia and beyond. This lectures series starts from the premise that text is not just data but social data. Texts are tied to specific contexts and cultural practices, properties that constitutes text data big potential, but also its high risk of being misinterpreted and misclassified - ruining  both interpretation and measurement. More generally textual data presents challenges that ranges from core concerns within machine learning, to deep methodological issues in the social science regarding quantification, interpretation and how to combine qualitative and quantitative modes of analysis. In this lecture series we have invited scholars how have made valuable contributions to the interdisciplinary field of computational text analysis.


A network approach to topic models

Our first speaker of the fall lecture series is Associate Professor at Central European University, Department of Network and Data Science, Tiago de Paula Peixoto

Abstract 

One of the main computational and scientific challenges in the modern age is to extract useful information from unstructured texts. Topic models are one popular machine-learning approach that infers the latent topical structure of a collection of documents. Despite their success—particularly of the most widely used variant called latent Dirichlet allocation (LDA)—and numerous applications in sociology, history, and linguistics, topic models are known to suffer from severe conceptual and practical problems, for example, a lack of justification for the Bayesian priors, discrepancies with statistical properties of real texts, and the inability to properly choose the number of topics. We obtain a fresh view of the problem of identifying topical structures by relating it to the problem of finding communities in complex networks. We achieve this by representing text corpora as bipartite networks of documents and words. By adapting existing community-detection methods (using a stochastic block model (SBM) with nonparametric priors), we obtain a more versatile and principled framework for topic modeling (for example, it automatically detects the number of topics and hierarchically clusters both the words and documents). The analysis of artificial and real corpora demonstrates that our SBM approach leads to better topic models than LDA in terms of statistical model selection. Our work shows how to formally relate methods from community detection and topic modeling, opening the possibility of cross-fertilization between these two fields.

The lecture will take place in building 1, 2ndfloor, room 26 (1.2.26) of the CSS Campus, University of Copenhagen, from 11.00 am to 12.30 pm.

If you have questions or want to know more, please write Sophie Smitt Sindrup Grønning at sophiegroenning@samf.ku.dk.