Publication:
Improving the representativeness of a simple random sample: an optimization model and its application to the Continuous Sample of Working Lives

Loading...
Thumbnail Image
Official URL
Full text at PDC
Publication Date
2019
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Facultad de Ciencias Económicas y Empresariales. Instituto Complutense de Análisis Económico (ICAE)
Citations
Google Scholar
Research Projects
Organizational Units
Journal Issue
Abstract
This paper develops an optimization model for selecting a large subsample that improves the representativeness of a simple random sample previously obtained from a population larger than the population of interest. The problem formulation involves convex mixed-integer nonlinear programming (convex MINLP) and is therefore NP-hard. However, the solution is found by maximizing the “constant of proportionality” – in other words, maximizing the size of the subsample taken from a stratified random sample with proportional allocation – and restricting it to a p-value high enough to achieve a good fit to the population of interest using Pearson’s chi-square goodness-of-fit test. The beauty of the model is that it gives the user the freedom to choose between a larger subsample with a poorer fit and a smaller subsample with a better fit. The paper also applies the model to a real case: The Continuous Sample of Working Lives (CSWL), which is a set of anonymized microdata containing information on individuals from Spanish Social Security records. Several waves (2005-2017) are first examined without using the model and the conclusion is that they are not representative of the target population, which in this case is people receiving a pension income. The model is then applied and the results prove that it is possible to obtain a large dataset from the CSWL that (far) better represents the pensioner population for each of the waves analysed.
Description
Unesco subjects
Keywords
Citation
Baillargeon, S., & Rivest, L. P. (2009). A general algorithm for univariate stratification. International Statistical Review, 77(3), 331-344. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the Chi-Square Test. Journal of the American Statistical Association, 33(203), 526-536. Bonami, P., Kilinç, M., & Linderoth. J. (2012). Algorithms and software for convex mixed integer nonlinear programs. In J. Lee & S. Leyferr (Eds.), Mixed Integer Nonlinear Programming. The IMA Volumes in Mathematics and its Applications, vol 154 (pp. 1-39). New York: Springer. Bowley, A. L. (1926). Measurement of precision attained in sampling. Bulletin of the International Statistical Institute 22(1), 6-62. Cochran, W. G. (1977). Sampling Techniques. New York: John Wiley. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale. NJ: Erlbaum. De Moura Brito, J. A., Do Nascimento Silva, P. L., Silva Semaan, G., & Maculan, N. (2015). Integer programming formulations applied to optimal allocation in stratified sampling. Survey Methodology, 41(2), 427-442. D’Ambrosio, C., & Lodi, A. (2013). Mixed integer nonlinear programming tools: an updated practical overview. Annals of Operations Research, 204(1), 301-320. Díaz-García, J. A., & Ramos-Quiroga, R. (2012). Optimum allocation in multivariable stratified random sampling: stochastic matrix mathematical programming. Statistica Neerlandica, 66(4), 492-511. Díaz-García, J. A., & Ramos-Quiroga, R. (2014). Optimum allocation in multivariable stratified random sampling: a modified Prékopa’s approach. Journal of Mathematical Modelling and Algorithms, 13, 315-330. DGOSS (2006-2018). Muestra Continua de Vidas Laborales 2005-2017. Madrid: Dirección General de Ordenación de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social. Grafström, A., & Schelin, L. (2014). How to select representative samples. Scandinavian Journal of Statistics, 41, 277–290. Gupta, N., Sana Ifthekar, S., & Bari, A. (2012). Fuzzy goal programming approach to solve non-linear bi-level programming problem in stratified double sampling design in presence of non-response. International Journal of Scientific & Engineering Research, 3(10), 1-9. Gupta, N., Ali, I., & Bari, A. (2014). An optimal chance constraint multivariate stratified sampling design using auxiliary information. Journal of Mathematical Modelling and Algorithms in Operations Research, 13(3), 341-352. INSS (2006-14). Informes Estadísticos 2005-2013. Madrid: Instituto Nacional de la Seguridad Social. Secretaría de Estado de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social. INSS (2006-18). Informes Estadísticos 2005-2017. Madrid: Instituto Nacional de la Seguridad Social. Secretaría de Estado de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social. Kontopantelis, E. (2013). A greedy algorithm for representative sampling: repsample in Stata. Journal of Statistical Software, 56, 1-18. Kruskall, W., & Mosteller, F. (1979a). Representative sampling, I. International Statistical Review, 47(1), 13-24. Kruskall, W., & Mosteller, F. (1979b). Representative sampling, II: scientific literature. excluding statistics. International Statistical Review, 47(2), 111-127. Kruskall, W., & Mosteller, F. (1979c). Representative sampling, III: The Current Statistical Literature. International Statistical Review, 47(3), 245-265. Kruskall, W., & Mosteller, F. (1980). Representative sampling, IV: the history of the Concept in Statistics. 1895-1939. International Statistical Review, 48(2), 169-195. Lin, M., Lucas, H. C., & Shmieli, G. (2013). Research commentary: too big to fail. Large samples and the p-value Problem. Information Systems Research, 24(4), 906-917. MESS (2018). MCVL. Muestra Continua de Vidas Laborales. Guía del contenido. Estadísticas. Presupuestos y Estudios. Estadísticas. Muestra Continua de Vidas Laborales. Documentación MCVL. http://www.seg-social.es/wps/wcm/connect/wss/320b09c6-dc33-42be-b532-08880e618742/MCVLGuia20180725.pdf?MOD=AJPERES&CVID= (accessed 11 Sep 2018). Neyman, J. (1934). On the two different aspects of the representative method: The method of representative sampling and the method of purposive sampling. Journal of the Royal Statistical Society, 97(4), 558-625. Nuñez-Antón, V., Pérez-Salamero González, J. M., Regúlez-Castillo, M., VenturaMarco, M., & Vidal-Meliá, C. (2019). Automatic regrouping of strata in the goodness-of-fit chi-square test. SORT, 43(1). In Press. Olsen. A.;Hudson, R. (2009). Social Security Administration’s Master Earnings File: background information. Social Security Bulletin, 69(3), 29-45. Omair. A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2(4), 142-147. Pérez-Salamero González, J. M., Regúlez-Castillo, M., & Vidal-Meliá, C. (2016). Análisis de la representatividad de la MCVL: el caso de las prestaciones del sistema público de pensiones. Hacienda Pública Española, 217(2), 67–130. Pérez-Salamero González, J. M., Regúlez-Castillo, M., & Vidal-Meliá. C. (2017). The continuous sample of working lives: improving its representativeness. SERIEs, 8(1), 43-95. Ramsey, C. A., & Hewitt, A. D. (2005). A Methodology for assessing sample representativeness. Environmental Forensics, 6, 71–75. Särndal, C. E., Swensson, B., & Wretman, J. (1992). Model Assisted Survey Sampling. Springer Series in Statistics. New York: Springer Verlag. Smith, C. (1989). The Social Security Administration's Continuous Work History Sample. Social Security Bulletin, 52(10), 20–28. Valliant, R., & Gentle, J. E. (1997). An application of mathematical programming to sample allocation. Computational Statistics & Data Analysis, 25(3), 337-360. Valliant, R., Dever, J., & Kreuter, F. (2013). Practical Tools for Designing and Weighting Survey Samples. Statistics for Social and Behavioral Sciences, 51. New York: Springer. Wang, C. (1993). Sense and Nonsense of Statistical Inference: Controversy, Misuse and Subtlety. New York: Marcel Dekker. Zweimüller, J., Winter-Ebmer, R., Lalive, R., Kuhn, A., Wuellrich, J.P., Ruf, O., & Büchi, S. (2009). Austrian Social Security Database. IEW - Working Papers, 410. Institute for Empirical Research in Economics - University of Zurich.