Improving the representativeness of a simple random sample: an optimization model and its application to the Continuous Sample of Working Lives
dc.contributor.author | Núñez Antón, Vicente | |
dc.contributor.author | Pérez Salamero González, Juan Manuel | |
dc.contributor.author | Regúlez Castillo, Marta | |
dc.contributor.author | Vidal-Meliá, Carlos | |
dc.date.accessioned | 2023-06-17T17:54:22Z | |
dc.date.available | 2023-06-17T17:54:22Z | |
dc.date.issued | 2019 | |
dc.description.abstract | This paper develops an optimization model for selecting a large subsample that improves the representativeness of a simple random sample previously obtained from a population larger than the population of interest. The problem formulation involves convex mixed-integer nonlinear programming (convex MINLP) and is therefore NP-hard. However, the solution is found by maximizing the “constant of proportionality” – in other words, maximizing the size of the subsample taken from a stratified random sample with proportional allocation – and restricting it to a p-value high enough to achieve a good fit to the population of interest using Pearson’s chi-square goodness-of-fit test. The beauty of the model is that it gives the user the freedom to choose between a larger subsample with a poorer fit and a smaller subsample with a better fit. The paper also applies the model to a real case: The Continuous Sample of Working Lives (CSWL), which is a set of anonymized microdata containing information on individuals from Spanish Social Security records. Several waves (2005-2017) are first examined without using the model and the conclusion is that they are not representative of the target population, which in this case is people receiving a pension income. The model is then applied and the results prove that it is possible to obtain a large dataset from the CSWL that (far) better represents the pensioner population for each of the waves analysed. | |
dc.description.faculty | Fac. de Ciencias Económicas y Empresariales | |
dc.description.faculty | Instituto Complutense de Análisis Económico (ICAE) | |
dc.description.refereed | TRUE | |
dc.description.sponsorship | Ministerio de Economía y Competitividad (MINECO)/FEDER | |
dc.description.sponsorship | Gobierno Vasco | |
dc.description.status | pub | |
dc.eprint.id | https://eprints.ucm.es/id/eprint/55423 | |
dc.identifier.issn | 2341-2356 | |
dc.identifier.relatedurl | https://www.ucm.es/icae | |
dc.identifier.uri | https://hdl.handle.net/20.500.14352/17484 | |
dc.issue.number | 20 | |
dc.language.iso | eng | |
dc.page.total | 30 | |
dc.publisher | Facultad de Ciencias Económicas y Empresariales. Instituto Complutense de Análisis Económico (ICAE) | |
dc.relation.ispartofseries | Documentos de Trabajo del Instituto Complutense de Análisis Económico (ICAE) | |
dc.relation.projectID | (ECO2015-65826-P; MTM2016-74931-P) | |
dc.relation.projectID | (IT 793-13; IT-642-13; UFI11/03) | |
dc.rights.accessRights | open access | |
dc.subject.jel | C61 | |
dc.subject.jel | C81 | |
dc.subject.jel | C12 | |
dc.subject.jel | H55 | |
dc.subject.jel | J26 | |
dc.subject.keyword | Optimization | |
dc.subject.keyword | Subsampling | |
dc.subject.keyword | Chi-square test | |
dc.subject.keyword | P-value | |
dc.subject.keyword | Continuous Sample of Working Lives. | |
dc.subject.ucm | Optimización matemática | |
dc.subject.ucm | Economía pública | |
dc.title | Improving the representativeness of a simple random sample: an optimization model and its application to the Continuous Sample of Working Lives | |
dc.type | technical report | |
dcterms.references | Baillargeon, S., & Rivest, L. P. (2009). A general algorithm for univariate stratification. International Statistical Review, 77(3), 331-344. Berkson, J. (1938). Some difficulties of interpretation encountered in the application of the Chi-Square Test. Journal of the American Statistical Association, 33(203), 526-536. Bonami, P., Kilinç, M., & Linderoth. J. (2012). Algorithms and software for convex mixed integer nonlinear programs. In J. Lee & S. Leyferr (Eds.), Mixed Integer Nonlinear Programming. The IMA Volumes in Mathematics and its Applications, vol 154 (pp. 1-39). New York: Springer. Bowley, A. L. (1926). Measurement of precision attained in sampling. Bulletin of the International Statistical Institute 22(1), 6-62. Cochran, W. G. (1977). Sampling Techniques. New York: John Wiley. Cohen, J. (1988). Statistical Power Analysis for the Behavioral Sciences (2nd ed.). Hillsdale. NJ: Erlbaum. De Moura Brito, J. A., Do Nascimento Silva, P. L., Silva Semaan, G., & Maculan, N. (2015). Integer programming formulations applied to optimal allocation in stratified sampling. Survey Methodology, 41(2), 427-442. D’Ambrosio, C., & Lodi, A. (2013). Mixed integer nonlinear programming tools: an updated practical overview. Annals of Operations Research, 204(1), 301-320. Díaz-García, J. A., & Ramos-Quiroga, R. (2012). Optimum allocation in multivariable stratified random sampling: stochastic matrix mathematical programming. Statistica Neerlandica, 66(4), 492-511. Díaz-García, J. A., & Ramos-Quiroga, R. (2014). Optimum allocation in multivariable stratified random sampling: a modified Prékopa’s approach. Journal of Mathematical Modelling and Algorithms, 13, 315-330. DGOSS (2006-2018). Muestra Continua de Vidas Laborales 2005-2017. Madrid: Dirección General de Ordenación de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social. Grafström, A., & Schelin, L. (2014). How to select representative samples. Scandinavian Journal of Statistics, 41, 277–290. Gupta, N., Sana Ifthekar, S., & Bari, A. (2012). Fuzzy goal programming approach to solve non-linear bi-level programming problem in stratified double sampling design in presence of non-response. International Journal of Scientific & Engineering Research, 3(10), 1-9. Gupta, N., Ali, I., & Bari, A. (2014). An optimal chance constraint multivariate stratified sampling design using auxiliary information. Journal of Mathematical Modelling and Algorithms in Operations Research, 13(3), 341-352. INSS (2006-14). Informes Estadísticos 2005-2013. Madrid: Instituto Nacional de la Seguridad Social. Secretaría de Estado de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social. INSS (2006-18). Informes Estadísticos 2005-2017. Madrid: Instituto Nacional de la Seguridad Social. Secretaría de Estado de la Seguridad Social. Ministerio de Trabajo, Migraciones y Seguridad Social. Kontopantelis, E. (2013). A greedy algorithm for representative sampling: repsample in Stata. Journal of Statistical Software, 56, 1-18. Kruskall, W., & Mosteller, F. (1979a). Representative sampling, I. International Statistical Review, 47(1), 13-24. Kruskall, W., & Mosteller, F. (1979b). Representative sampling, II: scientific literature. excluding statistics. International Statistical Review, 47(2), 111-127. Kruskall, W., & Mosteller, F. (1979c). Representative sampling, III: The Current Statistical Literature. International Statistical Review, 47(3), 245-265. Kruskall, W., & Mosteller, F. (1980). Representative sampling, IV: the history of the Concept in Statistics. 1895-1939. International Statistical Review, 48(2), 169-195. Lin, M., Lucas, H. C., & Shmieli, G. (2013). Research commentary: too big to fail. Large samples and the p-value Problem. Information Systems Research, 24(4), 906-917. MESS (2018). MCVL. Muestra Continua de Vidas Laborales. Guía del contenido. Estadísticas. Presupuestos y Estudios. Estadísticas. Muestra Continua de Vidas Laborales. Documentación MCVL. http://www.seg-social.es/wps/wcm/connect/wss/320b09c6-dc33-42be-b532-08880e618742/MCVLGuia20180725.pdf?MOD=AJPERES&CVID= (accessed 11 Sep 2018). Neyman, J. (1934). On the two different aspects of the representative method: The method of representative sampling and the method of purposive sampling. Journal of the Royal Statistical Society, 97(4), 558-625. Nuñez-Antón, V., Pérez-Salamero González, J. M., Regúlez-Castillo, M., VenturaMarco, M., & Vidal-Meliá, C. (2019). Automatic regrouping of strata in the goodness-of-fit chi-square test. SORT, 43(1). In Press. Olsen. A.;Hudson, R. (2009). Social Security Administration’s Master Earnings File: background information. Social Security Bulletin, 69(3), 29-45. Omair. A. (2014). Sample size estimation and sampling techniques for selecting a representative sample. Journal of Health Specialties, 2(4), 142-147. Pérez-Salamero González, J. M., Regúlez-Castillo, M., & Vidal-Meliá, C. (2016). Análisis de la representatividad de la MCVL: el caso de las prestaciones del sistema público de pensiones. Hacienda Pública Española, 217(2), 67–130. Pérez-Salamero González, J. M., Regúlez-Castillo, M., & Vidal-Meliá. C. (2017). The continuous sample of working lives: improving its representativeness. SERIEs, 8(1), 43-95. Ramsey, C. A., & Hewitt, A. D. (2005). A Methodology for assessing sample representativeness. Environmental Forensics, 6, 71–75. Särndal, C. E., Swensson, B., & Wretman, J. (1992). Model Assisted Survey Sampling. Springer Series in Statistics. New York: Springer Verlag. Smith, C. (1989). The Social Security Administration's Continuous Work History Sample. Social Security Bulletin, 52(10), 20–28. Valliant, R., & Gentle, J. E. (1997). An application of mathematical programming to sample allocation. Computational Statistics & Data Analysis, 25(3), 337-360. Valliant, R., Dever, J., & Kreuter, F. (2013). Practical Tools for Designing and Weighting Survey Samples. Statistics for Social and Behavioral Sciences, 51. New York: Springer. Wang, C. (1993). Sense and Nonsense of Statistical Inference: Controversy, Misuse and Subtlety. New York: Marcel Dekker. Zweimüller, J., Winter-Ebmer, R., Lalive, R., Kuhn, A., Wuellrich, J.P., Ruf, O., & Büchi, S. (2009). Austrian Social Security Database. IEW - Working Papers, 410. Institute for Empirical Research in Economics - University of Zurich. | |
dspace.entity.type | Publication |
Download
Original bundle
1 - 1 of 1