Better Data for NLP

Modern natural language processing is not only about the models, but also about the data. The decisions we make in collecting, curating, pre-processing, and annotating data have significant effect on the final systems, and we have a long way to go.

The projects in this area are concerned broadly with issues in collecting data that is used for training NLP models, and creating benchmarks that can reliably estimate their capabilities. This such aspects of data collection as sampling, representativeness, and annotation practices.



The wave of deep learning in NLP came together with increasing popularity of large crowdsourced datasets, but the limitations of this approach are becoming increasingly clear.

- an ongoing project in collaboration with University of Massachusetts, dedicated to development of resources for temporal relation extraction.

Prior relevant work by the current SODAS staff:

Rogers, A., Kovaleva, O., Downey, M., & Rumshisky, A. (2020). Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proceedings of the  AAAI Conference on Artificial Intelligence, 8722–8731.

Rogers, A., Smelkov, G., & Rumshisky, A. (2019). NarrativeTime: Dense High-Speed Temporal Annotation on a Timeline. ArXiv:1908.11443 [Cs].

Rogers, A., Romanov, A., Rumshisky, A., Volkova, S., Gronas, M., & Gribov, A. (2018). RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. Proceedings of the 27th International Conference on Computational Linguistics, 755–763.

Karpinska, M., Li, B., Rogers, A., & Drozd, A. (2018). Subcharacter Information in Japanese Embeddings: When Is It Worth It? Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, 28–37.

Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: What works and what doesn’t. Proceedings of the NAACL-HLT SRW, 47–54.


It is increasingly clear that while the current deep learning systems can solve any supervised learning NLP task that researchers have so far come up with, this is done without any real verbal reasoning. To make progress, the field needs to re-evaluate the role of resource development methodology, and provide clearer definitions for its tasks.

- Rogers, A., Gardner, M., and Augenstein, I. QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension. Under review. Preprint:

- Rogers, A., & Rumshisky, A. (2020). A guide to the dataset explosion in QA, NLI, and commonsense reasoning. Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts, 27–32.

- Three ongoing projects in collaboration with DIKU and Allen Institute for AI




Internal researchers

Name Title Phone E-mail
Search in Name Search in Title Search in Phone
Anna Rogers Assistant Professor +4535326548 E-mail
Terne Sasha Thorn Jakobsen Research Assistant   E-mail

Funded by:

Copenhagen Centre for Social Data Science (SODAS)

Full project name:
Better data for NLP: data collection and annotation methodology


Anna Rogers
Social Data Science
Phone: +45 35 32 65 48

External researchers:

Name Title Phone E-mail
Isabelle Augenstein Associate professor, Department of Computer Science, UCPH +45 93 56 59 19 E-mail
Matt Gardner Senior research scientist, Allen Institute for Artificial Intelligence E-mail
Marzena Karpinska Postdoctoral reseach Associate, UMASS Amherst E-mail
Anna Rumshisky Associate professor, UMASS Lowell +978-934-3619 E-mail
Anders Søgaard Professor, Department of Computer Sciences, UCPH +45 35 32 90 65 E-mail