Tratamiento de variables categóricas en modelos de Machine Learning
dc.contributor.advisor | Gregorio Rodríguez, Carlos | |
dc.contributor.author | Barragán, Rodrigo Kraus | |
dc.date.accessioned | 2023-06-22T21:23:36Z | |
dc.date.available | 2023-06-22T21:23:36Z | |
dc.date.defense | 2022 | |
dc.date.issued | 2022 | |
dc.description.abstract | Este Trabajo de Fin de Máster tiene como principal propósito estudiar diferentes formas de codificar variables categóricas diferenciando entre ordinales y nominales, mostrando la teoría que hay detrás de cada método, detallando las ventajas e inconvenientes de estos y en qué situaciones es conveniente un codificador u otro. Se hará también una clara distinción entre métodos clásicos para tratar estas variables y codificadores supervisados, los cuales se apoyan en la variable a predecir para sustituir cada categoría por un valor que represente la influencia que tienen sobre esta. Se utilizará además un conjunto de datos real para apoyar la teoría con ejemplos y finalmente se aplicará lo estudiado sobre este conjunto de datos y poder comprobar qué tal funcionan los diferentes codificadores sobre estos datos con varios modelos distintos. | |
dc.description.abstract | The aim of this Master´s Thesis is to study differents ways to encoder categorical variables, including ordinal and nominal variables. Different encoders are studied, showing the advantages and disadvantages and in wich situations it is appropiate to us each of them. It is made a distinction between classic and supervised encoders, this type of encoders replace the feature value with the influence it has over the target variable. It is used a real dataset for putting examples of each encoder and finally the study is applicated to this dataset, showing wich encoder is better for this case. | |
dc.description.department | Sección Deptal. de Sistemas Informáticos y Computación | |
dc.description.faculty | Fac. de Ciencias Matemáticas | |
dc.description.refereed | TRUE | |
dc.description.status | submitted | |
dc.eprint.id | https://eprints.ucm.es/id/eprint/75311 | |
dc.identifier.uri | https://hdl.handle.net/20.500.14352/74002 | |
dc.language.iso | spa | |
dc.master.title | Tratamiento estadístico computacional de la información | |
dc.rights.accessRights | open access | |
dc.subject.keyword | Variables categóricas | |
dc.subject.keyword | Minería de datos | |
dc.subject.keyword | Machine Learning | |
dc.subject.keyword | Codificadores clásicos | |
dc.subject.keyword | Codificadores supervisados | |
dc.subject.keyword | Categorical Variables | |
dc.subject.keyword | Data Mining | |
dc.subject.keyword | Classic Encoders | |
dc.subject.keyword | Supervised Encoders | |
dc.subject.ucm | Inteligencia artificial (Informática) | |
dc.subject.ucm | Investigación operativa (Matemáticas) | |
dc.subject.unesco | 1203.04 Inteligencia Artificial | |
dc.subject.unesco | 1207 Investigación Operativa | |
dc.title | Tratamiento de variables categóricas en modelos de Machine Learning | |
dc.type | master thesis | |
dcterms.references | [1] Subin An. “11 Categorical Encoders and Benchmark" (2019). url: https://www.kaggle.com/code/subinium/11-categorical-encoders-and-benchmark/notebook. [2] Deepanshu Bhalla. “Weight of Evidence (WOE) and Information Value (IV) explained".(2015). url: https://www.listendata.com/2015/03/weight-of-evidence-woeand-information.html. [3] Huy Bui. “How to Encode Categorical Data". (2020). url: https://towardsdatascience.com/how-to-encode-categorical-data-d44dde313131#bbad. [4] Bojan Cestnik. “Estimating Probabilities: A Crucial Task In Machine Learning”. En: Proceedings of ECAI 90, Stockholm, August (1990). [5] Bojan Cestnik e Ivan Bratko. “On estimating probabilities in tree pruning”. En: European Working Session on Learning. Springer. (1991), págs. 138-150. doi: 10.1007/BFb0017010. [6] Anna Veronika Dorogush, Vasily Ershov y Andrey Gulin. “CatBoost: gradient boosting with categorical features support”. En: arXiv preprint arXiv:1810.11363 (2018). doi: 10.48550/arXiv.1810.11363. [7] Bradley Efron y Carl Morris. “Stein’s Paradox in Statistics”. En: Scientific American 236.5 (1977), págs. 119-127. issn: 00368733, 19467087. url: http://www.jstor.org/stable/24954030. [8] Python Software Foundation. ”Hashlib — Secure hashes and message digests". https://docs.python.org/3/library/hashlib.html. [9] M.A. Gómez Villegas. “Inferencia estadística". Editorial Díaz de Santos, S.A., 2005. isbn: 9788479781224. url: https://books.google.es/books?id=YOuODwAAQBAJ. [10] John T Hancock y Taghi M Khoshgoftaar. “Survey on categorical data for neural networks”. En: Journal of Big Data 7.1 (2020), págs. 1-41. doi: 10.1186/s40537-020-00305-w. [11] William James y Charles Stein. “Estimation with quadratic loss”. En: Breakthroughs in statistics. Springer, (1992), págs. 443-460. doi: 10.1007/978-1-4612-0919-5_30. [12] Manu Joseph. “The Gradient Boosters V: CatBoost". (2020). url: https://deep-andshallow.com/2020/02/29/the-gradient-boosters-v-catboost/. [13] M. Kuhn y K. Johnson. “Feature Engineering and Selection: A Practical Approach for Predictive Models". Chapman & Hall/CRC Data Science Series. CRC Press, (2019). isbn: 9781351609470. url: https://books.google.es/books?id=xy73DwAAQBAJ. [14] Will McGinnis. “Category Encoders". https://contrib.scikit-learn.org/category_encoders/index.html. (2022). [15] Daniele Micci-Barreca. “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems”. En: ACM SIGKDD Explorations Newsletter 3.1 (2001), págs. 27-32. doi:10.1145/507533.507538. [16] Carlos Mougan et al. “Quantile encoder: Tackling high cardinality categorical features in regression problems”. En: International Conference on Modeling Decisions for Artificial Intelligence. Springer. (2021), págs. 168-180. [17] Möbius. “An Overview of Categorical Encoding Methods". (2020). url: https://www.kaggle.com/code/arashnic/an-overview-of-categorical-encoding-methods/notebook. [18] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. En: Journal of Machine Learning Research 12 (2011), págs. 2825-2830. [19] Liudmila Prokhorenkova et al. “CatBoost: unbiased boosting with categorical features”. En: Advances in neural information processing systems 31 (2018). url: https : / / proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html. [20] Adrián Rocha Íñigo. “Codificación de variables categóricas en aprendizaje automático". Universidad de Sevilla, (2020). url: https://idus.us.es/handle/11441/108887. [21] Chris Said. “Empirical Bayes for multiple sample sizes". (2017). url: https://chrissaid.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/. [22] Cedric Seger. “An investigation of categorical variable encoding techniques in machine learning: binary versus one-hot and feature hashing". (2018). url: https://www.divaportal.org/smash/record.jsf?dswid=-2643&pid=diva2%3A1259073. [23] Charles Stein. “Inadmissibility of the usual estimator for the mean of a multivariate normal distribution”. En: Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability: Contributions to the Theory of Statistics. Vol. 1. University of California Press. (1956), pág. 197. [24] Paul Westenthanner. “Unique levels, smoothing, and QuantileEncoder". (2022). url: https://github.com/scikit-learn-contrib/category_encoders/issues/327. | |
dspace.entity.type | Publication | |
relation.isAdvisorOfPublication | 05a01c46-aac8-42b2-a6bc-4b95860cf5bf | |
relation.isAdvisorOfPublication.latestForDiscovery | 05a01c46-aac8-42b2-a6bc-4b95860cf5bf |
Download
Original bundle
1 - 1 of 1
Loading...
- Name:
- TFM_Rodrigo_Kraus_Barragan.pdf
- Size:
- 987.96 KB
- Format:
- Adobe Portable Document Format