Better Data for NLP

Modern natural language processing is not only about the models, but also about the data. The decisions we make in collecting, curating, pre-processing, and annotating data have significant effect on the final systems, and we have a long way to go.

THE PROJECT IS COMPLETED

The projects in this area are concerned broadly with issues in collecting data that is used for training NLP models, and creating benchmarks that can reliably estimate their capabilities. This such aspects of data collection as sampling, representativeness, and annotation practices.

 

 

The wave of deep learning in NLP came together with increasing popularity of large crowdsourced datasets, but the limitations of this approach are becoming increasingly clear.

- an ongoing project in collaboration with University of Massachusetts, dedicated to development of resources for temporal relation extraction.

Prior relevant work by the current SODAS staff:

Rogers, A., Kovaleva, O., Downey, M., & Rumshisky, A. (2020). Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proceedings of the  AAAI Conference on Artificial Intelligence, 8722–8731. https://aaai.org/ojs/index.php/AAAI/article/view/6398

Rogers, A., Smelkov, G., & Rumshisky, A. (2019). NarrativeTime: Dense High-Speed Temporal Annotation on a Timeline. ArXiv:1908.11443 [Cs]. http://arxiv.org/abs/1908.11443

Rogers, A., Romanov, A., Rumshisky, A., Volkova, S., Gronas, M., & Gribov, A. (2018). RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. Proceedings of the 27th International Conference on Computational Linguistics, 755–763. http://aclweb.org/anthology/C18-1064

Karpinska, M., Li, B., Rogers, A., & Drozd, A. (2018). Subcharacter Information in Japanese Embeddings: When Is It Worth It? Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, 28–37. http://aclweb.org/anthology/W18-2905

Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: What works and what doesn’t. Proceedings of the NAACL-HLT SRW, 47–54. https://doi.org/10.18653/v1/N16-2002

 

It is increasingly clear that while the current deep learning systems can solve any supervised learning NLP task that researchers have so far come up with, this is done without any real verbal reasoning. To make progress, the field needs to re-evaluate the role of resource development methodology, and provide clearer definitions for its tasks.

- Rogers, A., Gardner, M., & Augenstein, I. (2023). QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension. ACM Computing Surveys 55(10), 1-45. DOI: https://dl.acm.org/doi/10.1145/3560260.

- Jakobsen, S. T. T., Barret, M., Søgaard, A. & Lassen, D. (2022). The Sensitivit of Annotator Bias to Task Definitions in Argument Mining. Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022, 44-61, Marseille, France. European Language Resources Association.

- Rogers, A., & Rumshisky, A. (2020). A guide to the dataset explosion in QA, NLI, and commonsense reasoning. Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts, 27–32. https://doi.org/10.18653/v1/2020.coling-tutorials.5

- Three ongoing projects in collaboration with DIKU and Allen Institute for AI

 

 

Funded by:

Copenhagen Centre for Social Data Science (SODAS)

Full project name:
Better data for NLP: data collection and annotation methodology

Contact

Anna Rogers
Postdoc
SODAS

External researchers:

Name Title Phone E-mail
Isabelle Augenstein Associate professor, Department of Computer Science, UCPH +45 93 56 59 19 E-mail
Matt Gardner Senior research scientist, Allen Institute for Artificial Intelligence E-mail
Marzena Karpinska Postdoctoral reseach Associate, UMASS Amherst E-mail
Anna Rumshisky Associate professor, UMASS Lowell +978-934-3619 E-mail
Anders Søgaard Professor, Department of Computer Sciences, UCPH +45 35 32 90 65 E-mail