Imputación de datos mediante Random Forest
Loading...
Download
Official URL
Full text at PDC
Publication date
2021
Authors
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
La información disponible es cada vez mayor y los institutos de estadística oficiales deben hacer uso de esta información para crear procesos innovadores y eficaces. El statistical learning es el conjunto de técnicas usadas para la mejor comprensión de los datos. Los random forests, basados en un ensemble de árboles de decisión, son una de las técnicas mas utilizadas de aprendizaje supervisado. En este trabajo se han usado random forests para la imputación de datos en encuestas económicas coyunturales y mas concretamente en los Índices de Cifras de Negocios de la Industria. La imputación se trata del proceso mediante el cual se asigna un valor a un ítem para el que previamente no se tenia información. En este estudio se elabora la metodología para la imputación después de analizar los criterios de calidad necesarios para la producción de una estadística oficial. En primer lugar se realiza la selección de variables o feature selection más interesante para el cálculo de las cifras de negocios. Posteriormente, se aborda el proceso de selección de parámetros para la obtención del modelo óptimo de bosques aleatorios para el conjunto de datos seleccionado. Finalmente se realiza una aplicación práctica del bosque aleatorio para las imputaciones y se evalúan obteniendo un resultado satisfactorio.
The amount of available information in National Statistical lnstitutes is increasing rapidly and they shall make use of it to develop innovative and effective processes. Statistical learning is the set of techniques used for better understanding of data. Random Forests, based on decision tree ensembles, are one of the most used techniques of supervised learning. In this thesis Random Forest have been used to impute data in short term business statistics. Imputation is defined as the method to give value to an item that previously was missing. In this study a new methodology is developed after analysing the quality requirements for official statistics. Firstly, the feature selection is carried out in order to get the set of variables that will be included in the model. After this, the tuning of the forests is carried out to get the optimum forest. Finally, this model is used to impute the missing values and the assessment of the accuracy of the estimation is carried out having satisfactory results.
The amount of available information in National Statistical lnstitutes is increasing rapidly and they shall make use of it to develop innovative and effective processes. Statistical learning is the set of techniques used for better understanding of data. Random Forests, based on decision tree ensembles, are one of the most used techniques of supervised learning. In this thesis Random Forest have been used to impute data in short term business statistics. Imputation is defined as the method to give value to an item that previously was missing. In this study a new methodology is developed after analysing the quality requirements for official statistics. Firstly, the feature selection is carried out in order to get the set of variables that will be included in the model. After this, the tuning of the forests is carried out to get the optimum forest. Finally, this model is used to impute the missing values and the assessment of the accuracy of the estimation is carried out having satisfactory results.
Description
Calificación: 10