RT Generic T1 Selective Data Editing of Continuous Variables with Random Forests in Official Statistics A1 Bohnensteffen, Sarah AB Technological advances and new demands due to economic and socio-cultural changes regularly challenge the National Statistical Institutes to adapt to their evolving environment. The application of machine learning methods as important and promising tools for official statistics are discussed in the context of these changes, in the context of opportunities arising from new digital data sources, and considering the difficult task of having to balance a variety of quality requirements at national and international level. Selective statistical data editing is an approach to detect influential units and select them for manual follow up in order to make the process more efficient. In this thesis, a simple and a two-step approach are developed to apply random forests to selective editing of continuous variables in the context of short-term business survey data. We present a score function based on decision forest models which allows for an efficient selection of units relevant for the estimation of the final estimates. The approach is found to be applicable also at the disaggregated levels of the autonomous communities and economic branches. AB El avance tecnológico y nuevas demandas debidas a cambios económicos y socioculturales desafían regularmente a los Institutos Nacionales de Estadística a adaptarse a su entorno en constante evolución. La aplicación de métodos de aprendizaje automático como instrumentos importantes y prometedores para las estadísticas oficiales se analizan en el contexto de esos cambios, en el contexto de las oportunidades que surgen de nuevas fuentes de datos digitales, y teniendo en cuenta la difícil tarea de tener que equilibrar una variedad de requisitos de calidad a nivel nacional e internacional. La depuración selectiva es un conjunto de técnicas para detectar unidades influyentes y seleccionarlas para el seguimiento manual a fin de hacer el proceso más eficiente. En este trabajo se desarrolla un enfoque simple y uno en dos etapas para aplicar los bosques aleatorios a la depuración selectiva de variables continuas en el contexto de datos de encuestas económicas coyunturales. Se presenta una función de puntuación basada en modelos de bosques aleatorios que permite una selección eficiente de unidades relevantes para la estimación de los agregados finales. El enfoque desarrollado también es aplicable a los niveles desagregados de las comunidades autónomas y ramas de negocio para los datos usados. YR 2020 FD 2020 LK https://hdl.handle.net/20.500.14352/9156 UL https://hdl.handle.net/20.500.14352/9156 LA eng NO Arbues, Ignacio, Pedro Revilla, and David Salgado (2013). “An optimization approachto selective editing”. In: Journal of Official Statistics 29.4, pp. 489–510.Barber, David (2012). Bayesian reasoning and machine learning. Cambridge UniversityPress. Beck, Martin, Florian Dumpert, and Joerg Feuerhake (2018). “Machine Learning inOfficial Statistics”. In: arXiv preprint arXiv:1812.10422.Biamonte, Jacob et al. (2017). “Quantum machine learning”. In: Nature 549.7671,pp. 195–202.Biemer, Paul P. (2010). “Total survey error: Design, implementation, and evaluation”.In: Public Opinion Quarterly 74.5, pp. 817–848.Boehmke, Brad and Brandon M. Greenwell (2019). Hands-On Machine Learning withR. Available at: https : / / bradleyboehmke . github . io / HOML / process . html,(accessed August 2020). CRC Press.Breiman, Leo (2001). “Random forests”. In: Machine learning 45.1, pp. 5–32.Coccia, Mario (2009). “Research performance and bureaucracy within public researchlabs”. In: Scientometrics 79.1, pp. 93–107.Crow, Michael M. and Barry L. Bozeman (1989). “Bureaucratization in the laboratory”.In: Research Technology Management 32.5, p. 30.Cutler, Adele, David Cutler, and John Stevens (Jan. 2011). “Random Forests”. In:vol. 45, pp. 157–176.De Waal, Ton (Dec. 2013). “Selective Editing: A Quest for Efficiency and Data Quality”.In: Journal of official statistics 29, pp. 473–488.De Waal, Ton, Jeroen Pannekoek, and Sander Scholtus (2011). Handbook of statisticaldata editing and imputation. Vol. 563. John Wiley & Sons.Di Zio, Marco and Ugo Guarnera (2013). “A contamination model for selective editing”.In: Journal of Official Statistics 29.4, pp. 539–555.European Statistical System Committee (2019). Quality Assurance Framework of theEuropean Statistical System (ESS QAF). Available at: https://ec.europa.eu/eurostat/documents/64157/4392716/ESS- QAF- V1- 2final.pdf/bbf5970c-1adf-46c8-afc3-58ce177a0646.European Union (2009). “Regulation (EC) No. 223/2009 of the European Earliamentand of the Council on European Statistics”. In: Official Journal of the EuropeanUnion 284. amended by Regulation (EU) 2015/759, available at: https://eur-lex.europa.eu/legal-content/en/TXT/PDF/?uri=CELEX:02009R0223-20150608&from=EN, p. 1.Eurostat (2017). “European Statistics Code of Practice”. In: Adopted by the EuropeanStatistical System Committee. available at: https://ec.europa.eu/eurostat/documents/4031688/8971242/KS-02-18-142-EN-N.pdf/e7f85f07-91db-4312-8118-f729c75878c7.Fawcett, Tom (2006). “An introduction to ROC analysis”. In: Pattern recognition letters27.8, pp. 861–874.Granquist, Leopold (1997). “The new view on editing”. In: International StatisticalReview 65.3, pp. 381–387.Bibliography 49Groves, Robert M. and Lars Lyberg (2010). “Total survey error: Past, present, andfuture”. In: Public opinion quarterly 74.5, pp. 849–879.Hastie, Trevor, Robert Tibshirani, and Jerome Friedman (2009). The elements of statisticallearning: data mining, inference, and prediction. Springer Science & BusinessMedia.Hedlin, Dan (2003). “Score functions to reduce business survey editing at the UKoffice for national statistics”. In: Journal of Official Statistics 19.2, pp. 177–200.— (2008). “Local and global score functions in selective editing”. In: Proceedings ofUN/ECE Work Session on Statistical Data Editing 21-23 April, Vienna. Available at:https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/2008/04/sde/wp.31.e.pdf.Ho, Tin Kam (1995). “Random decision forests”. In: Proceedings of 3rd internationalconference on document analysis and recognition. Vol. 1. IEEE, pp. 278–282.James, Gareth et al. (2013). An introduction to statistical learning. Vol. 112. Springer.Julien, Claude (2019). “Progress Report. Background document on the HLG-MOSMachine Learning Project”. In: Available at: https://statswiki.unece.org/display/ML/Machine+Learning+for+Official+Statistics+Home, (accessedAugust 2020).Kim, Seoyong, Wanki Paik, and Cheouljoo Lee (2014). “Does bureaucracy facilitatethe effect of information technology (IT)?” In: International Review of Public Administration19.3, pp. 219–237.Kuhn, Max and Kjell Johnson (2013). Applied predictive modeling. Vol. 26. Springer.Lange, Kerstin (2020). “Automation of E&I processes. Working Paper. Workshop onStatistical Data Editing 2020”. In: Available at: https://statswiki.unece.org/download/attachments/282329136/SDE2020_T4_Germany_Lange_Paper.pdf?version=1&modificationDate=1596798047993&api=v2, (accessed August 2020).LFEP (1989). Law 12/1989 of 9 May 1989 on the Public Statistical Services. BOE n. 112,11 May 1989.Liaw, Andy and MatthewWiener (2002). “Classification and Regression by random-Forest”. In: R News 2 3, pp. 18–22.Ljones, Olav (2011). “Independence and ethical issues for modern use of administrativedata in official statistics”. In: Statistical Journal of the IAOS 27.1, 2, pp. 25–29.López-Ureña, R. et al. (2014). “Application of the optimization approach to selectiveediting in the Spanish Industrial Turnover Index and Industrial New OrdersReceived Index survey”. In: INE Statistics Spain, Working Papers 4.Louppe, Gilles (2014). “Understanding random forests”. In: Cornell University Library.Luzi, O. et al. (2007). Recommended Practices for Editing and Imputation in Cross-SectionalBusiness Surveys (EDIMBUS), ISTAT, CBS, SFSO, Eurostat. Available at: https:/ / ec . europa . eu / eurostat / documents / 64157 / 4374310 / 30 - Recommended +Practices-for-editing-and-imputation-in-cross-sectional-business-surveys-2008.pdf.MacFeely, Steve (2016). “The continuing evolution of official statistics: Some challengesand opportunities”. In: Journal of Official Statistics 32.4, pp. 789–810.Measure, Alexander (2017). “Deep neural networks for worker injury autocoding”.In: Available at: https : / / www . bls . gov / iif / deep - neural - networks . pdf,(accessed August 2020).Moisen, GG (2008). “Classification and regression trees”. In: In: Jørgensen, Sven Erik;Fath, Brian D.(Editor-in-Chief). Encyclopedia of Ecology, volume 1. Oxford, UK: Elsevier.p. 582-588., pp. 582–588.Molnar, Christoph (2020). Interpretable Machine Learning. Lulu.Murphy, Kevin P (2012). Machine learning: a probabilistic perspective. MIT press.Olsen, Johan P. (2008). “The ups and downs of bureaucratic organization”. In: Annu.Rev. Polit. Sci. 11, pp. 13–37.Pannekoek, Jeroen, Sander Scholtus, and Mark Van der Loo (2013). “Automated andmanual data editing: a view on process design and methodology”. In: Journal ofOfficial Statistics 29.4, pp. 511–537.Probst, Philipp, MarvinNWright, and Anne-Laure Boulesteix (2019). “Hyperparametersand tuning strategies for random forest”. In: Wiley Interdisciplinary Reviews:Data Mining and Knowledge Discovery 9.3, e1301.Rama, Silvia and David Salgado (2014). “Standardising the editing phase at StatisticsSpain: a little step beyond EDIMBUS”. In: INE Statistics Spain, Working Papers 5.Revilla, Pedro and Asunción Piñán (2012). “Implementing a Quality Assurance Frameworkbased on the Code of Practice at the National Statistical Institute of Spain”.In: INE Statistics Spain, Working Papers 4.Sæbø, Hans Viggo and Anders Holmberg (2019). “Beyond code of practice: Newquality challenges in official statistics”. In: Statistical Journal of the IAOS 35.2,pp. 171–178.Scholtus, S., R. van de Laar, and L. Willenborg (2014). The memobust handbook onmethodology for modern business statistics (MEMOBUST Handbook).Sonak, Apurva and R.A. Patankar (2015). “A survey on methods to handle imbalancedataset”. In: Int. J. Comput. Sci. Mobile Comput 4.11, pp. 338–343.Spain (1978). Spanish Constitution. BOE n. 311, 29 December 1978.Spies, Lydia and Kerstin Lange (2018). “Implementation of artificial intelligence andmachine learning methods within the Federal Statistical Office of Germany.WorkingPaper. Workshop on Statistical Data Editing 2018”. In: Available at: https://www.unece.org/fileadmin/DAM/stats/documents/ece/ces/ge.44/2018/T4_Germany_LANGE_Paper.pdf, (accessed August 2020).Statistics Spain (2019). Standardised Methodological Report. Services Sector Activity Indicators(SSAI). Base 2015. Available at: https://www.ine.es/dynt3/metadatos/en/RespuestaDatos.html?oe=30183, (accessed August 2020).— (2020). Services Sector Activity Indicators (SSAI). Base 2015. Available at: https://www.ine.es/dyngs/INEbase/en/operacion.htm?c=Estadistica_C&cid=1254736176863&menu=ultiDatos&idp=1254735576778, (accessed August 2020).Stats NZ (2019). Data sources, editing, and imputation for the 2018 Census. Availableat: https://www.stats.govt.nz/assets/Uploads/Methods/Data- sources-editing-and-imputation-in-the-2018-Census/Data-sources-editing-and-imputation-in-the-2018-census.pdf, (accessed August 2020).United Nations Economic Commission for Europe (2019a). Generic Statistical BusinessProcess Model (GSBPM). Version 5.1.— (2019b). Generic Statistical Data Editing Model (GSDEM). Version 2.0.Vale, Steven (2014). “The Common Statistical Production Architecture: An ImportantNew Tool for Standardisation”. In: Weber, Max (1978). Economy and society: An outline of interpretive sociology. Vol. 1. University of California Press.Wright, Marvin N and Andreas Ziegler (2015). “ranger: A fast implementation ofrandom forests for high dimensional data in C++ and R”. In: arXiv preprint. DS Docta Complutense RD 3 may 2024