Aviso: para depositar documentos, por favor, inicia sesión e identifícate con tu cuenta de correo institucional de la UCM con el botón MI CUENTA UCM. No emplees la opción AUTENTICACIÓN CON CONTRASEÑA
 

Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees

dc.contributor.authorBordonaba Plou, David
dc.contributor.authorJreis-Navarro, Laila M.
dc.date.accessioned2025-03-13T08:25:53Z
dc.date.available2025-03-13T08:25:53Z
dc.date.issued2025
dc.descriptionEste trabajo se ha realizado en el marco del Grupo de Investigación "Clarisel", con el apoyo financiero de la Departamento de Ciencia, Tecnología y Universidad del Gobierno de Aragón y el Fondo Social Europeo.
dc.description.abstractCorpus linguistics is an essential tool in digital humanities, and multilingual corpora are valuable resources in cross-linguistic studies. In this article we address the multilingual layout of the TenTen corpus family, questioning the rationale to call it a family, and advancing the idea of different degrees of kinship for its language members. The analysis focuses on the performance of the Sketch Engine Word Sketch tool in the English Web 2020 corpus (enTenTen20) in comparison with the latest release of the arTenTen, Arabic Web 2018 corpus (arTenTen18), which has been processed by CAMeL tools, an Arabic-specific software, and its previous version, the arTenTen12, tagged with Stanford CoreNLP. The study shows the challenges posed by the platform tools and the tagged corpora regarding the dissimilarities between the available data and the reliability of the results of these tools for both languages, as well as the efforts made to tackle the challenges. The concluding remarks point to the need for a better definition of multilingualism in the TenTen corpora and, by extension, in the digital humanities as a whole, based on the structural design of the resources and tools meant for such theoretical aspirations.
dc.description.departmentDepto. de Lógica y Filosofía Teórica
dc.description.facultyFac. de Filosofía
dc.description.refereedTRUE
dc.description.sponsorshipMinisterio de Ciencia, Innovación y Universidades (España)
dc.description.sponsorshipGobierno de Aragón
dc.description.sponsorshipEuropean Commission
dc.description.statuspub
dc.identifier.citationBordonaba-Plou, D. y Jreis-Navarro, L. M. (2025). "Are the tenten corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees", International Journal of Humanities and Arts Computing, 19(1), 49-64. https://doi.org/10.3366/ijhac.2025.0344.
dc.identifier.doi10.3366/ijhac.2025.0344
dc.identifier.essn1755-1706
dc.identifier.issn1753-8548
dc.identifier.officialurlhttps://doi.org/10.3366/ijhac.2025.0344
dc.identifier.relatedurlhttps://www.euppublishing.com/doi/abs/10.3366/ijhac.2025.0344
dc.identifier.urihttps://hdl.handle.net/20.500.14352/118738
dc.issue.number1
dc.journal.titleInternational Journal of Humanities and Arts Computing
dc.language.isoeng
dc.page.final64
dc.page.initial49
dc.publisherEdinburgh University Press
dc.relation.projectIDinfo:eu-repo/grantAgreement/MICIU/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2023-150396OA-I00/ES/INTUICIONES Y FILOSOFIA EXPERIMENTAL DEL LENGUAJE/IFEL
dc.relation.projectIDinfo:eu-repo/grantAgreement/MICIN/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/ES/TRANSFORMACIONES DEL ESPACIO MAGREBI EN PERSPECTIVA HISTORICA/TRAMAGHIS
dc.rights.accessRightsembargoed access
dc.subject.cdu81
dc.subject.cdu82:001.8
dc.subject.keywordCorpus linguistics
dc.subject.keywordCross-linguistics
dc.subject.keywordArabic language
dc.subject.keywordPart-of-speech tagging
dc.subject.keywordLinguistic injustice
dc.subject.ucmHumanidades
dc.subject.unesco57 Lingüística
dc.titleAre the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees
dc.typejournal article
dc.type.hasVersionAM
dc.volume.number19
dspace.entity.typePublication
relation.isAuthorOfPublication5f5cd501-3e2c-47fc-8383-5a867a43724c
relation.isAuthorOfPublication.latestForDiscovery5f5cd501-3e2c-47fc-8383-5a867a43724c

Download

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Are the TenTen__Final_Clean.pdf
Size:
555.8 KB
Format:
Adobe Portable Document Format

Collections