RT Journal Article T1 Automatic regrouping of strata in the goodness-of-fit chi-square test A1 Núñez-Antón, Vicente A1 Pérez Salamero González, Juan Manuel A1 Regúlez Castillo, Marta A1 Ventura-Marco, Manuel A1 Vidal-Meliá, Carlos AB Pearson’s chi-square test is widely employed in social and health sciences to analyse categorical data and contingency tables. For the test to be valid, the sample size must be large enough to provide a minimum number of expected elements per category. This paper develops functions for regrouping strata automatically, thus enabling the goodness-of-fit test to be performed within an iterative procedure. The usefulness and performance of these functions is illustrated by means of a simulation study and the application to different datasets. Finally, the iterative use of the functions is applied to the Continuous Sample of Working Lives, a dataset that has been used in a considerable number of studies, especially on labour economics and the Spanish public pension system. PB Institut d'Estadística de Catalunya (Idescat) SN 2013–8830 YR 2019 FD 2019 LK https://hdl.handle.net/20.500.14352/140.1 UL https://hdl.handle.net/20.500.14352/140.1 LA eng NO Agresti, A. (2002). Categorical Data Analysis (2nd edition). Wiley, New York.Bartholomew, D.J. and Tzamourani, P. (1999). The goodness-of-fit of latent trait models in attitude measurement. Sociological Methods and Research, 27, 525–546.Bartholomew, D.J., Knott, M. and Moustaki, I. (2011). Latent Variable Models and Factor Analysis (3rd edition). Wiley, New York.Bishop, Y.M.M., Fienberg, S.E. and Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. MIT Press, Cambridge.Bosgiraud, J. (2006). Sur le regroupement des classes dans le test du Khi-2. Revue Romaine de Mathematiques Pures et Appliquees, 51, 167–172.Cai, L., Maydeu-Olivares, A., Coffman, D.L. and Thissen, D. (2006). Limited-information goodness-of-fit testing of item response theory models for sparse 2p tables. British Journal of Mathematical and Statistical Psychology, 59, 173–194.Campbell, I. (2007). Chi-squared and Fisher-Irwin tests of two-by-two tables with small sample recommendations. Statistics in Medicine, 26, 3661–3675.Cochran, W.G. (1952). The ?2 test of goodness-of-fit. The Annals of Mathematical Statistics, 23, 315–345.Collins, L.M., Fidler, P.L., Wugalter, S.E. and Long, J. (1993). Goodness-of-fit testing for latent class models. Multivariate Behavioral Research, 28, 375–389.Delucchi, K.L. (1983). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166–176.DGOSS (2014). Muestra Continua de Vidas Laborales 2013. Secretaría de Estado de la Seguridad Social. Dirección General de Ordenación (DGOSS). Ministerio de Trabajo e Inmigración. Madrid, Spain.Fienberg, S.E. (2006). Log-linear models in contingency tables. In Encyclopedia of Statistical Sciences. Wiley, New York.Fisher, R.A. (1935). The logic of inductive inference. Journal of the Royal Statistical Society, 98, 39–54.García Pérez, M.A. and Nuñez-Antón, V. (2009). Accuracy of power-divergence statistics for testing independence and homogeneity in two-way contingency tables. Communications in Statistics - Simulation and Computation, 38, 503–512.Goodman, L.A. (1974). Exploratory latent structures analysis using both identifiable and unidentifiable models. Biometrika, 61, 215–231.Grafstörm, A. and Schelin, L. (2014). How to select representative samples. Scandinavian Journal of Statistics, 41, 277–290.Haviland, M.G. (1990). Yates´ s correction for continuity and the analysis of 2 × 2 contingency-tables. Statistics in Medicine, 9, 363–367.Hirji, K.F. (2006). Exact Analysis of Discrete Data. Chapman and Hall, Boca Raton.Hosmer, D.W., Hosmer, T., Le Cessie, S. and Lemeshow, S. (1997). A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 16, 965–980.Hosmer, D.W. and Lemeshow, S. (2000). Applied Logistic Regression. Wiley, New York.INSS (2014). Informe Estadístico 2013. Secretaría de Estado de Seguridad Social. Ministerio de Empleo y Seguridad Social, MESS. Madrid, Spain.Keeling, K.B. and Pavur, R.J. (2011). Statistical accuracy of spreadsheet software. The American Statistician, 65, 265–273.Khan, H.A. (2003). A visual basic software for computing Fisher´s exact probability. Journal of Statistical Software, 8, 1–7.Kroonenberg, P.M. and Verbeek, A. (2018). The tale of Cochran´s rule: my contingency table has so many expected values smaller than 5, what am I to do? The American Statistician, 72, 175–183.Kruskall, W. and Mosteller, F. (1979a). Representative sampling, I. International Statistical Review, 47, 13–24.Kruskall, W. and Mosteller, F. (1979b). Representative sampling, II: scientific literature, excludind statistics. International Statistical Review, 47, 111–127.Kruskall, W. and Mosteller, F. (1979c). Representative sampling, III: the current statistical literature. International Statistical Review, 47, 245–265.Kruskall, W. and Mosteller, F. (1980). Representative sampling, IV: The History of the Concept in Statistics, 1895-1939. International Statistical Review, 48, 169–195.Larose, D.T. and Larose, C.D. (2014). Discovering Knowledge in Data: An Introduction to Data Mining. Wiley, New York.Lazarsfeld, P.F. and Henry, N.W. (1968). Latent Structure Analysis. Houghton Mifflin, Boston.Lewis, D. and Burke, C.J. (1949). The use and misuse of chi-square. Psychological Bulletin, 46, 433–489.Lin, J.J., Chang, C.H. and Pal, N. (2015). A revisit to contingency table and tests of Independence: bootstrap is preferred to chi-square approximations as well as Fisher’s exact test. Journal of Biopharmaceutical Statistics, 25, 438–458.Lydersen, S., Fagerland, M.W. and Laake, P. (2009). Tutorial in biostatistics. Recommended tests for association in 2x2 tables. Statistics in Medicine, 28, 1159–1175.Marsaglia, G. (2003). Random number generators. Journal of Modern Applied Statistical Methods, 2, 2–13.McCullough, B.D. (2000). The accuracy of Mathematica 4 as a statistical package. Computational Statistics, 15, 279–299.McCullough, B.D. (2008). Special section on Microsoft Excel 2007. Computational Statistics and Data Analysis, 52, 4568–4569.Mehta, C.R. and Patel, N.R. (1983). A network algorithm for performing Fisher’s exact test in r×c contingency tables. Journal of the American Statistical Association, 78, 427–434.MESS (2017). La Muestra Continua de Vidas Laborales. Guía del contenido. Estadísticas, Presupuestos y Estudios. Estadísticas. Secretaría de Estado de Seguridad Social. Ministerio de Empleo y Seguridad Social, MESS. Madrid, Spain.Moore, D.S. (1986). Tests of chi-squared type. In Goodness-of-fit Techniques (R. D’Agostino and M. Stephens, eds.). Marcel Dekker, New York, 63–95.Okeniyi, J.O. and Okeniyi, E.T. (2012). Implementation of Kolmogorov Smirnov p-value computation in Visual Basic: implication for Microsoft Excel library function. Journal of Statistical Computation and Simulation, 82, 1727–1741.Omair, A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2, 142–147.Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50, 157–175.Pérez-Salamero González, J.M. (2015). La Muestra Continua de Vidas Laborales (MCVL) como fuente generadora de datos para el estudio del sistema de pensiones. Unpublished Ph.D. Thesis. Universitat de Valencia, Spain.Pérez-Salamero González, J.M., Regúlez-Castillo, M. and Vidal-Meliá, C. (2016). Análisis de la representatividad de la MCVL: el caso de las prestaciones del sistema público de pensiones. Hacienda Pública Española (Review of Public Economics), 217, 67–130.Pérez-Salamero Gonzélez, J.M., Regúlez-Castillo, M. and Vidal-Meliá, C. (2017). The continuous sample of working lives: improving its representativeness. SERIEs. Journal of the Spanish Economic Association, 8, 43–95.Quintela-del-Río, A. and Francisco-Fernández, M. (2017). Excel templates: a helpful tool for teaching statistics. The American Statistician, 71, 317–325.Ramsey, C.A. and Hewitt, A.D. (2005). A methodology for assessing sample representativeness. Environmental Forensics, 6, 71–75.Ripley, B.D. (2002). Statistical methods need software: a view of statistical computing. Opening lecture - Royal Statistical Society, Plymouth.Ross, A. (2015). Probability or statistics-permorming a chi-square goodness-of-fit test. Mathematical Stack Exchange.Tollenaar, N. and Mooijaart, A. (2003). Type I errors and power of the parametric bootstrap goodness-of-fit test: Full and limited information. British Journal of Mathematical and Statistical Psychology, 56, 271–288.Tsang, W.W. and Cheng, K.H. (2006). The chi-square test when the expected frequencies are less than 5. In COMPSTAT 2006 - Proceedings in Computational Statistics (A. Rizzi and M. Vichi, eds.). Physica Verlag - Springer, Heidelberg, 1583–1589.Wickens, T.D. (1989). Multiway Contingency Tables Analysis for the Social Sciences. Hillsdale, NJ: Erlbaum.Wilkinson, L. (1994). Practical guidelines for testing statistical software. In Computational Statistics: Papers Collected on the Occasion of the 25th Conference on Statistical Computing at Schloss Reisensburg (P. Dirschedl and R. Ostermann, eds.). Physica Verlag - Springer, Heidelberg, 1–16.Yates, F. (1934). Contingency tables involving small numbers and the ?2 test. Supplement to the Journal of the Royal Statistical Society, 1, 217–235. NO Ministerio de Economía y Competitividad (MINECO)/FEDER NO Universidad del País Vasco DS Docta Complutense RD 30 abr 2024