Aviso: para depositar documentos, por favor, inicia sesión e identifícate con tu cuenta de correo institucional de la UCM con el botón MI CUENTA UCM. No emplees la opción AUTENTICACIÓN CON CONTRASEÑA
 

Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees

Loading...
Thumbnail Image

Full text at PDC

Publication date

2025

Advisors (or tutors)

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Edinburgh University Press
Citations
Google Scholar

Citation

Bordonaba-Plou, D. y Jreis-Navarro, L. M. (2025). "Are the tenten corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees", International Journal of Humanities and Arts Computing, 19(1), 49-64. https://doi.org/10.3366/ijhac.2025.0344.

Abstract

Corpus linguistics is an essential tool in digital humanities, and multilingual corpora are valuable resources in cross-linguistic studies. In this article we address the multilingual layout of the TenTen corpus family, questioning the rationale to call it a family, and advancing the idea of different degrees of kinship for its language members. The analysis focuses on the performance of the Sketch Engine Word Sketch tool in the English Web 2020 corpus (enTenTen20) in comparison with the latest release of the arTenTen, Arabic Web 2018 corpus (arTenTen18), which has been processed by CAMeL tools, an Arabic-specific software, and its previous version, the arTenTen12, tagged with Stanford CoreNLP. The study shows the challenges posed by the platform tools and the tagged corpora regarding the dissimilarities between the available data and the reliability of the results of these tools for both languages, as well as the efforts made to tackle the challenges. The concluding remarks point to the need for a better definition of multilingualism in the TenTen corpora and, by extension, in the digital humanities as a whole, based on the structural design of the resources and tools meant for such theoretical aspirations.

Research Projects

Organizational Units

Journal Issue

Description

Este trabajo se ha realizado en el marco del Grupo de Investigación "Clarisel", con el apoyo financiero de la Departamento de Ciencia, Tecnología y Universidad del Gobierno de Aragón y el Fondo Social Europeo.

UCM subjects

Unesco subjects

Keywords

Collections