Herramienta de generacion de datasets sintéticos con funcionalidades de anonimización automática
Loading...
Official URL
Full text at PDC
Publication date
2025
Authors
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
En un contexto en el que la preservación de la privacidad de los usuarios se ha visto impulsada por reglamentos como el Reglamento General de Protección de Datos, se han empezado a crear herramientas que emplean Procesamiento del Lenguaje Natural (NLP), más concretamente Reconocimiento de Entidades Nombradas (NER), y Reconocimiento Automático del Habla para poder realizar esta tarea de forma automática. Modelos basados en la arquitectura Transformer han conseguido potenciar esta práctica. Sin embargo, en un contexto de anonimización en español, se carece de Datasets para entrenar estos modelos.
En este Trabajo de Fin de Grado se propone la creación de una herramienta basada en NER para poder generar estos Datasets de forma sintética, usando Aprendizaje Profundo en Conjunto. Para ello, se emplea Longformer para realizar una anonimizacion sobre entidades reconocidas mediante un modulo de NER en español. A su vez, se estudian varios modelos preentrenados basados en RoBERTa y Longformer para este propósito, evaluando cual es el más acorde a la tarea. Además, se compara el rendimiento de modelos para anonimización con distintos sistemas de etiquetado.
Los resultados obtenidos muestran cómo ciertas variantes de RoBERTa y Longformer tienen un buen desempeño en tareas de NER. En cuanto a la anonimización, Longformer
ha demostrado tener un gran rendimiento realizando la tarea ́unicamente sobre entidades, simplificando en gran medida la tarea de NLP.
Finalmente, se ha propuesto una herramienta que permite crear Datasets sintéticos para anonimización en español, mostrando rendimientos sorprendentes ante la gran escasez
de Datasets para esta tarea.
In a context where user privacy preservation has been driven by regulations such as the General Data Protection Regulation, tools that employ Natural Language Processing (NLP), specifically Named Entity Recognition (NER), and Automatic Speech Recognition have begun to emerge to automatically carry out this task. Transformer-based models have significantly enhanced this practice. However, in the context of anonymization in Spanish language, there is a lack of datasets to train these models. This Bachelor’s Thesis proposes the creation of a tool based on NER to generate such datasets synthetically, using Ensemble Deep Learning. For this purpose, Longformer is used to perform anonymization on entities recognized by a Spanish Named Entity Recognition module. Additionally, several pretrained models based on RoBERTa and Longformer are studied for this task, evaluating which is best suited for it. Furthermore, the performance of anonymization models is compared using different tagging systems. The results show that certain variants of RoBERTa and Longformer perform well in NER tasks. In terms of anonymization, Longformer has demonstrated strong performance when carrying out the task exclusively on entities, greatly simplifying the NLP task. Finally, a tool has been proposed that enables the creation of synthetic datasets for anonymization in Spanish, showing promising performance despite the significant scarcity of datasets for this task.
In a context where user privacy preservation has been driven by regulations such as the General Data Protection Regulation, tools that employ Natural Language Processing (NLP), specifically Named Entity Recognition (NER), and Automatic Speech Recognition have begun to emerge to automatically carry out this task. Transformer-based models have significantly enhanced this practice. However, in the context of anonymization in Spanish language, there is a lack of datasets to train these models. This Bachelor’s Thesis proposes the creation of a tool based on NER to generate such datasets synthetically, using Ensemble Deep Learning. For this purpose, Longformer is used to perform anonymization on entities recognized by a Spanish Named Entity Recognition module. Additionally, several pretrained models based on RoBERTa and Longformer are studied for this task, evaluating which is best suited for it. Furthermore, the performance of anonymization models is compared using different tagging systems. The results show that certain variants of RoBERTa and Longformer perform well in NER tasks. In terms of anonymization, Longformer has demonstrated strong performance when carrying out the task exclusively on entities, greatly simplifying the NLP task. Finally, a tool has been proposed that enables the creation of synthetic datasets for anonymization in Spanish, showing promising performance despite the significant scarcity of datasets for this task.
Description
Trabajo de Fin de Grado en Ingeniería Informática, Facultad Informática UCM, Dpto. de Ingeniería del Software e Inteligencia Artificial, Curso 2024/2025












