Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees
Loading...
Official URL
Full text at PDC
Publication date
2025
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Edinburgh University Press
Citation
Bordonaba-Plou, D. y Jreis-Navarro, L. M. (2025). "Are the tenten corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees", International Journal of Humanities and Arts Computing, 19(1), 49-64. https://doi.org/10.3366/ijhac.2025.0344.
Abstract
Corpus linguistics is an essential tool in digital humanities, and multilingual corpora are valuable resources in cross-linguistic studies. In this article we address the multilingual layout of the TenTen corpus family, questioning the rationale to call it a family, and advancing the idea of different degrees of kinship for its language members. The analysis focuses on the performance of the Sketch Engine Word Sketch tool in the English Web 2020 corpus (enTenTen20) in comparison with the latest release of the arTenTen, Arabic Web 2018 corpus (arTenTen18), which has been processed by CAMeL tools, an Arabic-specific software, and its previous version, the arTenTen12, tagged with Stanford CoreNLP. The study shows the challenges posed by the platform tools and the tagged corpora regarding the dissimilarities between the available data and the reliability of the results of these tools for both languages, as well as the efforts made to tackle the challenges. The concluding remarks point to the need for a better definition of multilingualism in the TenTen corpora and, by extension, in the digital humanities as a whole, based on the structural design of the resources and tools meant for such theoretical aspirations.
Description
Este trabajo se ha realizado en el marco del Grupo de Investigación "Clarisel", con el apoyo financiero de la Departamento de Ciencia, Tecnología y Universidad del Gobierno de Aragón y el
Fondo Social Europeo.