Better Data for NLP
Modern natural language processing is not only about the models, but also about the data. The decisions we make in collecting, curating, pre-processing, and annotating data have significant effect on the final systems, and we have a long way to go.
THE PROJECT IS COMPLETED
The projects in this area are concerned broadly with issues in collecting data that is used for training NLP models, and creating benchmarks that can reliably estimate their capabilities. This such aspects of data collection as sampling, representativeness, and annotation practices.
The wave of deep learning in NLP came together with increasing popularity of large crowdsourced datasets, but the limitations of this approach are becoming increasingly clear.
- an ongoing project in collaboration with University of Massachusetts, dedicated to development of resources for temporal relation extraction.
Prior relevant work by the current SODAS staff:
Rogers, A., Kovaleva, O., Downey, M., & Rumshisky, A. (2020). Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks. Proceedings of the AAAI Conference on Artificial Intelligence, 8722–8731. https://aaai.org/ojs/index.php/AAAI/article/view/6398
Rogers, A., Smelkov, G., & Rumshisky, A. (2019). NarrativeTime: Dense High-Speed Temporal Annotation on a Timeline. ArXiv:1908.11443 [Cs]. http://arxiv.org/abs/1908.11443
Rogers, A., Romanov, A., Rumshisky, A., Volkova, S., Gronas, M., & Gribov, A. (2018). RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian. Proceedings of the 27th International Conference on Computational Linguistics, 755–763. http://aclweb.org/anthology/C18-1064
Karpinska, M., Li, B., Rogers, A., & Drozd, A. (2018). Subcharacter Information in Japanese Embeddings: When Is It Worth It? Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP, 28–37. http://aclweb.org/anthology/W18-2905
Gladkova, A., Drozd, A., & Matsuoka, S. (2016). Analogy-based detection of morphological and semantic relations with word embeddings: What works and what doesn’t. Proceedings of the NAACL-HLT SRW, 47–54. https://doi.org/10.18653/v1/N16-2002
It is increasingly clear that while the current deep learning systems can solve any supervised learning NLP task that researchers have so far come up with, this is done without any real verbal reasoning. To make progress, the field needs to re-evaluate the role of resource development methodology, and provide clearer definitions for its tasks.
- Rogers, A., Gardner, M., & Augenstein, I. (2023). QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension. ACM Computing Surveys 55(10), 1-45. DOI: https://dl.acm.org/doi/10.1145/3560260.
- Jakobsen, S. T. T., Barret, M., Søgaard, A. & Lassen, D. (2022). The Sensitivit of Annotator Bias to Task Definitions in Argument Mining. Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) within LREC2022, 44-61, Marseille, France. European Language Resources Association.
- Rogers, A., & Rumshisky, A. (2020). A guide to the dataset explosion in QA, NLI, and commonsense reasoning. Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts, 27–32. https://doi.org/10.18653/v1/2020.coling-tutorials.5
- Three ongoing projects in collaboration with DIKU and Allen Institute for AI
Funded by:
Copenhagen Centre for Social Data Science (SODAS)
Full project name:
Better data for NLP: data collection and annotation methodology
Contact
Anna Rogers
Postdoc
SODAS
External researchers:
Name | Title | Phone | |
---|---|---|---|
Isabelle Augenstein | Associate professor, Department of Computer Science, UCPH | +45 93 56 59 19 | |
Matt Gardner | Senior research scientist, Allen Institute for Artificial Intelligence | ||
Marzena Karpinska | Postdoctoral reseach Associate, UMASS Amherst | ||
Anna Rumshisky | Associate professor, UMASS Lowell | +978-934-3619 | |
Anders Søgaard | Professor, Department of Computer Sciences, UCPH | +45 35 32 90 65 |