Selección óptima de variables mediante Computación Evolutiva para algoritmos de clasificación. Aplicación a la identificación de individuos en riesgo de desarrollar sobrepeso
Loading...
Official URL
Full text at PDC
Publication date
2021
Authors
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
En este trabajo de fin de máster, se ha diseñado un sistema de selección de variables (feature selection) para sistemas clasificadores, basado en Computación Evolutiva. En concreto se han investigado distintas configuraciones de un algoritmo genético y se propone una estructura particular del proceso de selección que proporciona resultados interesantes. El algoritmo tiene como misión seleccionar el conjunto de variables o features más adecuado para un algoritmo de clasificación. Se utiliza una codificación binaria directa que nos permite realizar la evaluación de los individuos de manera eficiente, en la que un individuo codifica como 1 aquellas variables que se utilizarán en el clasificador. Para identificar estas variables, se evalúan los individuos mediante el accuracy, obtenido por el clasificador sobre el que se quiere aplicar, en un conjunto de datos reducido.
Este sistema se ha aplicado con los clasificadores mencionados a los datos del proyecto Genobia-CM, aunque su diseño permite aplicarlo a cualquier otro problema que utilice el formato de datos de entrada adecuado, que es el habitual en problemas de clasificación. Genobia es un proyecto participado por un consorcio de 20 instituciones, hospitales y empresas, financiado por el Fondo Social Europeo y la Comunidad de Madrid (genobia.es). El proyecto busca diseñar, utilizando inteligencia artificial, algoritmos predictivos para la identificación de personas en riesgo de desarrollar sobrepeso, obesidad y sus patologías asociadas. En este trabajo se han utilizado una base de datos con 1179 individuos proporcionada por el consorcio en el que se recoge información de los hábitos de vida y adherencia a la dieta mediterránea. El trabajo presentado en el presente documento se centra en la selección de variables que aporten más información para la correcta clasificación de los usuarios en dos grupos, por un lado, aquellos cuyos datos apunten a que no padecerán sobrepeso y aquellos con mayor probabilidad de padecer dicho trastorno. Para ello ha sido necesario la comprensión tanto de los datos que se manejaba como de las herramientas empleadas para dicha selección. Nuestro algoritmo evolutivo de selección se ha aplicado con éxito sobre los algoritmos de Gradient Boosting y árboles de decisión, permitiendo incrementar el accuracy hasta un 8 %, llegando hasta valores de 75 %. Nuestro diseño se ha realizado de tal manera que pueda aplicarse a los datos que proporcione el consorcio en el futuro. Estos datos incluirán información genética de cada individuo, así como un mayor número de casos.
In this paper, a feature selection system has been designed for classifying systems, based on Evolutionary Computing. In particular, different configurations of a genetic algorithm have been investigated and a particular structure of the selection process is proposed, which provides interesting results. The mission of the algorithm is to select the most appropriate set of variables or features for a classification algorithm. It uses a direct binary coding that allows us to perform the evaluation of individuals efficiently, in which an individual codes as 1 those variables that will be used in the classifier. To identify these variables, individuals are evaluated by means of the accuracy, obtained by the classifier on which we want to apply, in a reduced data set. This system has been applied with the classifiers mentioned to the data of the GenobiaCM project, although its design allows it to be applied to any other problem using the appropriate input data format, which is the usual one in classification problems. Genobia is a project participated by a consortium of 20 institutions, hospitals and companies, financed by the European Social Fund and the Community of Madrid (genobia.es). The project seeks to design, using artificial intelligence, predictive algorithms for the identification of people at risk of developing overweight, obesity and their associated pathologies. In this work, a database with 1179 individuals provided by the Consortium has been used to collect information about living habits and adherence to the Mediterranean diet. The work presented in this document is focussed on the selection of variables that provide more information for the correct classification of users into two groups, on the one hand, those whose data indicate that they will not be overweight and those with a greater probability of suffering from this disorder. In order to do this, it was necessary to understand both, the data that was handled and the tools used for such selection. Our evolutionary algorithm of selection has been applied successfully on the algorithms of Gradient Boosting and decision trees, allowing to increase the accuracy up to 8 %, arriving up to values of 75 %. Our design has been made in such a way that it can be applied to the data provided by the consortium in the future. This data will include genetic information of each individual, as well as a greater number of cases.
In this paper, a feature selection system has been designed for classifying systems, based on Evolutionary Computing. In particular, different configurations of a genetic algorithm have been investigated and a particular structure of the selection process is proposed, which provides interesting results. The mission of the algorithm is to select the most appropriate set of variables or features for a classification algorithm. It uses a direct binary coding that allows us to perform the evaluation of individuals efficiently, in which an individual codes as 1 those variables that will be used in the classifier. To identify these variables, individuals are evaluated by means of the accuracy, obtained by the classifier on which we want to apply, in a reduced data set. This system has been applied with the classifiers mentioned to the data of the GenobiaCM project, although its design allows it to be applied to any other problem using the appropriate input data format, which is the usual one in classification problems. Genobia is a project participated by a consortium of 20 institutions, hospitals and companies, financed by the European Social Fund and the Community of Madrid (genobia.es). The project seeks to design, using artificial intelligence, predictive algorithms for the identification of people at risk of developing overweight, obesity and their associated pathologies. In this work, a database with 1179 individuals provided by the Consortium has been used to collect information about living habits and adherence to the Mediterranean diet. The work presented in this document is focussed on the selection of variables that provide more information for the correct classification of users into two groups, on the one hand, those whose data indicate that they will not be overweight and those with a greater probability of suffering from this disorder. In order to do this, it was necessary to understand both, the data that was handled and the tools used for such selection. Our evolutionary algorithm of selection has been applied successfully on the algorithms of Gradient Boosting and decision trees, allowing to increase the accuracy up to 8 %, arriving up to values of 75 %. Our design has been made in such a way that it can be applied to the data provided by the consortium in the future. This data will include genetic information of each individual, as well as a greater number of cases.
Description
Trabajo de Fin de Máster en Ingeniería Informática, Facultad de Informática UCM, Departamento de Arquitectura de Computadores y Automática, Curso 2020/2021.