Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees
dc.contributor.author | Bordonaba Plou, David | |
dc.contributor.author | Jreis-Navarro, Laila M. | |
dc.date.accessioned | 2025-03-13T08:25:53Z | |
dc.date.available | 2025-03-13T08:25:53Z | |
dc.date.issued | 2025 | |
dc.description | Este trabajo se ha realizado en el marco del Grupo de Investigación "Clarisel", con el apoyo financiero de la Departamento de Ciencia, Tecnología y Universidad del Gobierno de Aragón y el Fondo Social Europeo. | |
dc.description.abstract | Corpus linguistics is an essential tool in digital humanities, and multilingual corpora are valuable resources in cross-linguistic studies. In this article we address the multilingual layout of the TenTen corpus family, questioning the rationale to call it a family, and advancing the idea of different degrees of kinship for its language members. The analysis focuses on the performance of the Sketch Engine Word Sketch tool in the English Web 2020 corpus (enTenTen20) in comparison with the latest release of the arTenTen, Arabic Web 2018 corpus (arTenTen18), which has been processed by CAMeL tools, an Arabic-specific software, and its previous version, the arTenTen12, tagged with Stanford CoreNLP. The study shows the challenges posed by the platform tools and the tagged corpora regarding the dissimilarities between the available data and the reliability of the results of these tools for both languages, as well as the efforts made to tackle the challenges. The concluding remarks point to the need for a better definition of multilingualism in the TenTen corpora and, by extension, in the digital humanities as a whole, based on the structural design of the resources and tools meant for such theoretical aspirations. | |
dc.description.department | Depto. de Lógica y Filosofía Teórica | |
dc.description.faculty | Fac. de Filosofía | |
dc.description.refereed | TRUE | |
dc.description.sponsorship | Ministerio de Ciencia, Innovación y Universidades (España) | |
dc.description.sponsorship | Gobierno de Aragón | |
dc.description.sponsorship | European Commission | |
dc.description.status | pub | |
dc.identifier.citation | Bordonaba-Plou, D. y Jreis-Navarro, L. M. (2025). "Are the tenten corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees", International Journal of Humanities and Arts Computing, 19(1), 49-64. https://doi.org/10.3366/ijhac.2025.0344. | |
dc.identifier.doi | 10.3366/ijhac.2025.0344 | |
dc.identifier.essn | 1755-1706 | |
dc.identifier.issn | 1753-8548 | |
dc.identifier.officialurl | https://doi.org/10.3366/ijhac.2025.0344 | |
dc.identifier.relatedurl | https://www.euppublishing.com/doi/abs/10.3366/ijhac.2025.0344 | |
dc.identifier.uri | https://hdl.handle.net/20.500.14352/118738 | |
dc.issue.number | 1 | |
dc.journal.title | International Journal of Humanities and Arts Computing | |
dc.language.iso | eng | |
dc.page.final | 64 | |
dc.page.initial | 49 | |
dc.publisher | Edinburgh University Press | |
dc.relation.projectID | info:eu-repo/grantAgreement/MICIU/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/PID2023-150396OA-I00/ES/INTUICIONES Y FILOSOFIA EXPERIMENTAL DEL LENGUAJE/IFEL | |
dc.relation.projectID | info:eu-repo/grantAgreement/MICIN/Plan Estatal de Investigación Científica y Técnica y de Innovación 2021-2023/ES/TRANSFORMACIONES DEL ESPACIO MAGREBI EN PERSPECTIVA HISTORICA/TRAMAGHIS | |
dc.rights.accessRights | embargoed access | |
dc.subject.cdu | 81 | |
dc.subject.cdu | 82:001.8 | |
dc.subject.keyword | Corpus linguistics | |
dc.subject.keyword | Cross-linguistics | |
dc.subject.keyword | Arabic language | |
dc.subject.keyword | Part-of-speech tagging | |
dc.subject.keyword | Linguistic injustice | |
dc.subject.ucm | Humanidades | |
dc.subject.unesco | 57 Lingüística | |
dc.title | Are the TenTen corpora really a corpus family? On linguistic tagging and corpora members’ kinship degrees | |
dc.type | journal article | |
dc.type.hasVersion | AM | |
dc.volume.number | 19 | |
dspace.entity.type | Publication | |
relation.isAuthorOfPublication | 5f5cd501-3e2c-47fc-8383-5a867a43724c | |
relation.isAuthorOfPublication.latestForDiscovery | 5f5cd501-3e2c-47fc-8383-5a867a43724c |
Download
Original bundle
1 - 1 of 1
Loading...
- Name:
- Are the TenTen__Final_Clean.pdf
- Size:
- 555.8 KB
- Format:
- Adobe Portable Document Format