Selective Data Editing of Continuous Variables with Random Forests in Official Statistics

Thumbnail Image
Official URL
Full text at PDC
Publication Date
Journal Title
Journal ISSN
Volume Title
Google Scholar
Research Projects
Organizational Units
Journal Issue
Technological advances and new demands due to economic and socio-cultural changes regularly challenge the National Statistical Institutes to adapt to their evolving environment. The application of machine learning methods as important and promising tools for official statistics are discussed in the context of these changes, in the context of opportunities arising from new digital data sources, and considering the difficult task of having to balance a variety of quality requirements at national and international level. Selective statistical data editing is an approach to detect influential units and select them for manual follow up in order to make the process more efficient. In this thesis, a simple and a two-step approach are developed to apply random forests to selective editing of continuous variables in the context of short-term business survey data. We present a score function based on decision forest models which allows for an efficient selection of units relevant for the estimation of the final estimates. The approach is found to be applicable also at the disaggregated levels of the autonomous communities and economic branches.
El avance tecnológico y nuevas demandas debidas a cambios económicos y socioculturales desafían regularmente a los Institutos Nacionales de Estadística a adaptarse a su entorno en constante evolución. La aplicación de métodos de aprendizaje automático como instrumentos importantes y prometedores para las estadísticas oficiales se analizan en el contexto de esos cambios, en el contexto de las oportunidades que surgen de nuevas fuentes de datos digitales, y teniendo en cuenta la difícil tarea de tener que equilibrar una variedad de requisitos de calidad a nivel nacional e internacional. La depuración selectiva es un conjunto de técnicas para detectar unidades influyentes y seleccionarlas para el seguimiento manual a fin de hacer el proceso más eficiente. En este trabajo se desarrolla un enfoque simple y uno en dos etapas para aplicar los bosques aleatorios a la depuración selectiva de variables continuas en el contexto de datos de encuestas económicas coyunturales. Se presenta una función de puntuación basada en modelos de bosques aleatorios que permite una selección eficiente de unidades relevantes para la estimación de los agregados finales. El enfoque desarrollado también es aplicable a los niveles desagregados de las comunidades autónomas y ramas de negocio para los datos usados.
Arbues, Ignacio, Pedro Revilla, and David Salgado (2013). “An optimization approach to selective editing”. In: Journal of Official Statistics 29.4, pp. 489–510. Barber, David (2012). Bayesian reasoning and machine learning. Cambridge University Press. Beck, Martin, Florian Dumpert, and Joerg Feuerhake (2018). “Machine Learning in Official Statistics”. In: arXiv preprint arXiv:1812.10422. Biamonte, Jacob et al. (2017). “Quantum machine learning”. In: Nature 549.7671, pp. 195–202. Biemer, Paul P. (2010). “Total survey error: Design, implementation, and evaluation”. In: Public Opinion Quarterly 74.5, pp. 817–848. Boehmke, Brad and Brandon M. Greenwell (2019). Hands-On Machine Learning with R. Available at: https : / / bradleyboehmke . github . io / HOML / process . html, (accessed August 2020). CRC Press. Breiman, Leo (2001). “Random forests”. In: Machine learning 45.1, pp. 5–32. Coccia, Mario (2009). “Research performance and bureaucracy within public research labs”. In: Scientometrics 79.1, pp. 93–107. Crow, Michael M. and Barry L. Bozeman (1989). “Bureaucratization in the laboratory”. In: Research Technology Management 32.5, p. 30. Cutler, Adele, David Cutler, and John Stevens (Jan. 2011). “Random Forests”. In: vol. 45, pp. 157–176. De Waal, Ton (Dec. 2013). “Selective Editing: A Quest for Efficiency and Data Quality”. In: Journal of official statistics 29, pp. 473–488. De Waal, Ton, Jeroen Pannekoek, and Sander Scholtus (2011). Handbook of statistical data editing and imputation. Vol. 563. John Wiley & Sons. Di Zio, Marco and Ugo Guarnera (2013). “A contamination model for selective editing”. In: Journal of Official Statistics 29.4, pp. 539–555. European Statistical System Committee (2019). Quality Assurance Framework of the European Statistical System (ESS QAF). Available at: eurostat/documents/64157/4392716/ESS- QAF- V1- 2final.pdf/bbf5970c- 1adf-46c8-afc3-58ce177a0646. European Union (2009). “Regulation (EC) No. 223/2009 of the European Earliament and of the Council on European Statistics”. In: Official Journal of the European Union 284. amended by Regulation (EU) 2015/759, available at: https: // 20150608&from=EN, p. 1. Eurostat (2017). “European Statistics Code of Practice”. In: Adopted by the European Statistical System Committee. available at: documents/4031688/8971242/KS-02-18-142-EN-N.pdf/e7f85f07-91db-4312- 8118-f729c75878c7. Fawcett, Tom (2006). “An introduction to ROC analysis”. In: Pattern recognition letters 27.8, pp. 861–874. Granquist, Leopold (1997). “The new view on editing”. In: International Statistical Review 65.3, pp. 381–387. Bibliography 49 Groves, Robert M. and Lars Lyberg (2010). “Total survey error: Past, present, and future”. In: Public opinion quarterly 74.5, pp. 849–879. Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media. Hedlin, Dan (2003). “Score functions to reduce business survey editing at the UK office for national statistics”. In: Journal of Official Statistics 19.2, pp. 177–200. — (2008). “Local and global score functions in selective editing”. In: Proceedings of UN/ECE Work Session on Statistical Data Editing 21-23 April, Vienna. Available at: sde/wp.31.e.pdf. Ho, Tin Kam (1995). “Random decision forests”. In: Proceedings of 3rd international conference on document analysis and recognition. Vol. 1. IEEE, pp. 278–282. James, Gareth et al. (2013). An introduction to statistical learning. Vol. 112. Springer. Julien, Claude (2019). “Progress Report. Background document on the HLG-MOS Machine Learning Project”. In: Available at: display/ML/Machine+Learning+for+Official+Statistics+Home, (accessed August 2020). Kim, Seoyong, Wanki Paik, and Cheouljoo Lee (2014). “Does bureaucracy facilitate the effect of information technology (IT)?” In: International Review of Public Administration 19.3, pp. 219–237. Kuhn, Max and Kjell Johnson (2013). Applied predictive modeling. Vol. 26. Springer. Lange, Kerstin (2020). “Automation of E&I processes. Working Paper. Workshop on Statistical Data Editing 2020”. In: Available at: download/attachments/282329136/SDE2020_T4_Germany_Lange_Paper.pdf? version=1&modificationDate=1596798047993&api=v2, (accessed August 2020). LFEP (1989). Law 12/1989 of 9 May 1989 on the Public Statistical Services. BOE n. 112, 11 May 1989. Liaw, Andy and MatthewWiener (2002). “Classification and Regression by random- Forest”. In: R News 2 3, pp. 18–22. Ljones, Olav (2011). “Independence and ethical issues for modern use of administrative data in official statistics”. In: Statistical Journal of the IAOS 27.1, 2, pp. 25– 29. López-Ureña, R. et al. (2014). “Application of the optimization approach to selective editing in the Spanish Industrial Turnover Index and Industrial New Orders Received Index survey”. In: INE Statistics Spain, Working Papers 4. Louppe, Gilles (2014). “Understanding random forests”. In: Cornell University Library. Luzi, O. et al. (2007). Recommended Practices for Editing and Imputation in Cross-Sectional Business Surveys (EDIMBUS), ISTAT, CBS, SFSO, Eurostat. Available at: https: / / ec . europa . eu / eurostat / documents / 64157 / 4374310 / 30 - Recommended + Practices-for-editing-and-imputation-in-cross-sectional-business- surveys-2008.pdf. MacFeely, Steve (2016). “The continuing evolution of official statistics: Some challenges and opportunities”. In: Journal of Official Statistics 32.4, pp. 789–810. Measure, Alexander (2017). “Deep neural networks for worker injury autocoding”. In: Available at: https : / / www . bls . gov / iif / deep - neural - networks . pdf, (accessed August 2020). Moisen, GG (2008). “Classification and regression trees”. In: In: Jørgensen, Sven Erik; Fath, Brian D.(Editor-in-Chief). Encyclopedia of Ecology, volume 1. Oxford, UK: Elsevier. p. 582-588., pp. 582–588. Molnar, Christoph (2020). Interpretable Machine Learning. Lulu. Murphy, Kevin P (2012). Machine learning: a probabilistic perspective. MIT press. Olsen, Johan P. (2008). “The ups and downs of bureaucratic organization”. In: Annu. Rev. Polit. Sci. 11, pp. 13–37. Pannekoek, Jeroen, Sander Scholtus, and Mark Van der Loo (2013). “Automated and manual data editing: a view on process design and methodology”. In: Journal of Official Statistics 29.4, pp. 511–537. Probst, Philipp, MarvinNWright, and Anne-Laure Boulesteix (2019). “Hyperparameters and tuning strategies for random forest”. In: Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9.3, e1301. Rama, Silvia and David Salgado (2014). “Standardising the editing phase at Statistics Spain: a little step beyond EDIMBUS”. In: INE Statistics Spain, Working Papers 5. Revilla, Pedro and Asunción Piñán (2012). “Implementing a Quality Assurance Framework based on the Code of Practice at the National Statistical Institute of Spain”. In: INE Statistics Spain, Working Papers 4. Sæbø, Hans Viggo and Anders Holmberg (2019). “Beyond code of practice: New quality challenges in official statistics”. In: Statistical Journal of the IAOS 35.2, pp. 171–178. Scholtus, S., R. van de Laar, and L. Willenborg (2014). The memobust handbook on methodology for modern business statistics (MEMOBUST Handbook). Sonak, Apurva and R.A. Patankar (2015). “A survey on methods to handle imbalance dataset”. In: Int. J. Comput. Sci. Mobile Comput 4.11, pp. 338–343. Spain (1978). Spanish Constitution. BOE n. 311, 29 December 1978. Spies, Lydia and Kerstin Lange (2018). “Implementation of artificial intelligence and machine learning methods within the Federal Statistical Office of Germany.Working Paper. Workshop on Statistical Data Editing 2018”. In: Available at: https: // Germany_LANGE_Paper.pdf, (accessed August 2020). Statistics Spain (2019). Standardised Methodological Report. Services Sector Activity Indicators (SSAI). Base 2015. Available at: en/RespuestaDatos.html?oe=30183, (accessed August 2020). — (2020). Services Sector Activity Indicators (SSAI). Base 2015. Available at: https: // 1254736176863&menu=ultiDatos&idp=1254735576778, (accessed August 2020). Stats NZ (2019). Data sources, editing, and imputation for the 2018 Census. Available at: sources- editing-and-imputation-in-the-2018-Census/Data-sources-editing-and- imputation-in-the-2018-census.pdf, (accessed August 2020). United Nations Economic Commission for Europe (2019a). Generic Statistical Business Process Model (GSBPM). Version 5.1. — (2019b). Generic Statistical Data Editing Model (GSDEM). Version 2.0. Vale, Steven (2014). “The Common Statistical Production Architecture: An Important New Tool for Standardisation”. In: Weber, Max (1978). Economy and society: An outline of interpretive sociology. Vol. 1. University of California Press. Wright, Marvin N and Andreas Ziegler (2015). “ranger: A fast implementation of random forests for high dimensional data in C++ and R”. In: arXiv preprint.