Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees

Bordonaba Plou, David; Jreis-Navarro, Laila M.

doi:10.3366/ijhac.2025.0344

Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees

Download

(Restricted until: 2026-03-01)Are the TenTen__Final_Clean.pdf (555.8 KB)

Official URL

https://doi.org/10.3366/ijhac.2025.0344

Publication date

2025

Authors

Bordonaba Plou, David

Jreis-Navarro, Laila M.

Publisher

Edinburgh University Press

Citations

Exportar

URI

https://hdl.handle.net/20.500.14352/118738

Citation

Bordonaba-Plou, D. y Jreis-Navarro, L. M. (2025). "Are the tenten corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees", International Journal of Humanities and Arts Computing, 19(1), 49-64. https://doi.org/10.3366/ijhac.2025.0344.

Abstract

Corpus linguistics is an essential tool in digital humanities, and multilingual corpora are valuable resources in cross-linguistic studies. In this article we address the multilingual layout of the TenTen corpus family, questioning the rationale to call it a family, and advancing the idea of different degrees of kinship for its language members. The analysis focuses on the performance of the Sketch Engine Word Sketch tool in the English Web 2020 corpus (enTenTen20) in comparison with the latest release of the arTenTen, Arabic Web 2018 corpus (arTenTen18), which has been processed by CAMeL tools, an Arabic-specific software, and its previous version, the arTenTen12, tagged with Stanford CoreNLP. The study shows the challenges posed by the platform tools and the tagged corpora regarding the dissimilarities between the available data and the reliability of the results of these tools for both languages, as well as the efforts made to tackle the challenges. The concluding remarks point to the need for a better definition of multilingualism in the TenTen corpora and, by extension, in the digital humanities as a whole, based on the structural design of the resources and tools meant for such theoretical aspirations.

Description

Este articulo estará en libre acceso a partir del 01/03/2026 Este trabajo se ha realizado en el marco del Grupo de Investigación "Clarisel", con el apoyo financiero de la Departamento de Ciencia, Tecnología y Universidad del Gobierno de Aragón y el Fondo Social Europeo.

UCM subjects

Humanidades

Unesco subjects

57 Lingüística

Collections

Artículos

Full item page

Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees

Download

Official URL

Full text at PDC

Publication date

Authors

Advisors (or tutors)

Editors

Journal Title

Journal ISSN

Volume Title

Publisher

Citations

Exportar

URI

Citation

Abstract

Research Projects

Organizational Units

Journal Issue

Description

UCM subjects

Unesco subjects

Keywords

Collections