RT Generic
T1 Selective Data Editing of Continuous Variables with Random Forests in Official Statistics
A1 Bohnensteffen, Sarah
AB Technological advances and new demands due to economic and socio-cultural changes regularly challenge the National Statistical Institutes to adapt to their evolving environment. The application of machine learning methods as important and promising tools for official statistics are discussed in the context of these changes, in the context of opportunities arising from new digital data sources, and considering the difficult task of having to balance a variety of quality requirements at national and international level. Selective statistical data editing is an approach to detect influential units and select them for manual follow up in order to make the process more efficient. In this thesis, a simple and a two-step approach are developed to apply random forests to selective editing of continuous variables in the context of short-term business survey data. We present a score function based on decision forest models which allows for an efficient selection of units relevant for the estimation of the final estimates. The approach is found to be applicable also at the disaggregated levels of the autonomous communities and economic branches.
AB El avance tecnológico y nuevas demandas debidas a cambios económicos y socioculturales desafían regularmente a los Institutos Nacionales de Estadística a adaptarse a su entorno en constante evolución. La aplicación de métodos de aprendizaje automático como instrumentos importantes y prometedores para las estadísticas oficiales se analizan en el contexto de esos cambios, en el contexto de las oportunidades que surgen de nuevas fuentes de datos digitales, y teniendo en cuenta la difícil tarea de tener que equilibrar una variedad de requisitos de calidad a nivel nacional e internacional. La depuración selectiva es un conjunto de técnicas para detectar unidades influyentes y seleccionarlas para el seguimiento manual a fin de hacer el proceso más eficiente. En este trabajo se desarrolla un enfoque simple y uno en dos etapas para aplicar los bosques aleatorios a la depuración selectiva de variables continuas en el contexto de datos de encuestas económicas coyunturales. Se presenta una función de puntuación basada en modelos de bosques aleatorios que permite una selección eficiente de unidades relevantes para la estimación de los agregados finales. El enfoque desarrollado también es aplicable a los niveles desagregados de las comunidades autónomas y ramas de negocio para los datos usados.
YR 2020
FD 2020
LK https://hdl.handle.net/20.500.14352/9156
UL https://hdl.handle.net/20.500.14352/9156
LA eng
DS Docta Complutense
RD 23 ago 2025