Eficiencia y Equidad en Problemas de Clasificación de Datos con Aplicaciones Empresariales
Loading...
Download
Official URL
Full text at PDC
Publication date
2023
Defense date
07/10/2022
Authors
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Universidad Complutense de Madrid
Citation
Abstract
En los últimos años, la necesidad de prevenir los sesgos de clasificación debidos a la raza, género, sexo, religión, entre otros, ha aumentado el interés por diseñar algoritmos de clustering justos. La idea principal es asegurar que la salida de un algoritmo de cluster no esté sesgada hacia o contra subgrupos específicos de la población. Existe una creciente literatura especializada en este tema, que aborda el problema del clustering de bases de datos numéricas (Chierichetti et al., 2017; Luong et al., 2011; Hardt et al., 2016; Dwork et al., 2011).En la presente Tesis Doctoral se propone una metodología para realizar clustering sobre datos categóricos puros y/o mixtos, que contengan atributos sensibles o protegidos, aunando la precisión en el agrupamiento y la equidad para conseguir conjuntos finales justos y equitativos, asegurando la transparencia, fiabilidad, precisión y equidad en el momento de formar los grupos o clusters finales. Por supuesto, existe un trade-off entre equidad y eficiencia, de modo que un aumento del objetivo de equidad suele conllevar una pérdida de eficiencia en la clasificación. Sin embargo, es posible alcanzar un compromiso razonable entre estos objetivos, ya que la metodología propuesta en esta Tesis (Santos & Heras, 2020; 2021) puede adaptarse fácilmente para obtener clusters homogéneos y justos. El uso del paquete estadístico R entre la comunidad científica (R Core Team, 2018) esta extendido y es común su uso, al incluir tanto herramientas de análisis de datos, como para generar multitud de gráficas, siendo además dicho software de carácter libre y que funciona bajo distintos sistemas operativos, como Windows, Mac-Os y Linux (https://www.r-project.org/).Por todo lo anterior, parece interesante para la comunidad científica que exista un paquete en R que pueda ofrecer una alternativa a los métodos existentes hasta el momento, aunando tanto la clasificación como la equidad de conjuntos de datos con aplicación empresarial.
In recent years, the need to prevent classification biases due to race, gender, sex, religion, among others, has increased interest in designing fair clustering algorithms. The main idea is to ensure that the output of a clustering algorithm is not biased towards or against specific subgroups of the population. There is a growing specialized literature on this topic, addressing the problem of numerical database clustering (Chierichetti et al., 2017; Luong et al., 2011; Hardt et al., 2016; Dwork et al., 2011).In this PhD Thesis, we propose a methodology to perform clustering on pure and/or mixed categorical data, containing sensitive or protected attributes, combining clustering accuracy and fairness to achieve fair and equitable final sets, ensuring transparency, reliability, accuracy and fairness when forming the final groups or clusters.Of course, there is a trade-off between fairness and efficiency, so that an increase in the fairness objective usually leads to a loss of classification efficiency. However, it is possible to reach a reasonable compromise between these objectives, since the methodology proposed in this Thesis (Santos & Heras, 2020; 2021) can be easily adapted to obtain homogeneous and fair clusters.The use of the R statistical package among the scientific community (R Core Team, 2018) is widespread and its use is common, as it includes both data analysis tools, as well as to generate a multitude of graphs, being also such software of free character and running under different operating systems, such as Windows, Mac-Os and Linux (https://www.r-project.org/).For all these reasons, it seems interesting for the scientific community that there is a package in R that can offer an alternative to the existing methods so far, combining both classification and fairness of datasets with business application.
In recent years, the need to prevent classification biases due to race, gender, sex, religion, among others, has increased interest in designing fair clustering algorithms. The main idea is to ensure that the output of a clustering algorithm is not biased towards or against specific subgroups of the population. There is a growing specialized literature on this topic, addressing the problem of numerical database clustering (Chierichetti et al., 2017; Luong et al., 2011; Hardt et al., 2016; Dwork et al., 2011).In this PhD Thesis, we propose a methodology to perform clustering on pure and/or mixed categorical data, containing sensitive or protected attributes, combining clustering accuracy and fairness to achieve fair and equitable final sets, ensuring transparency, reliability, accuracy and fairness when forming the final groups or clusters.Of course, there is a trade-off between fairness and efficiency, so that an increase in the fairness objective usually leads to a loss of classification efficiency. However, it is possible to reach a reasonable compromise between these objectives, since the methodology proposed in this Thesis (Santos & Heras, 2020; 2021) can be easily adapted to obtain homogeneous and fair clusters.The use of the R statistical package among the scientific community (R Core Team, 2018) is widespread and its use is common, as it includes both data analysis tools, as well as to generate a multitude of graphs, being also such software of free character and running under different operating systems, such as Windows, Mac-Os and Linux (https://www.r-project.org/).For all these reasons, it seems interesting for the scientific community that there is a package in R that can offer an alternative to the existing methods so far, combining both classification and fairness of datasets with business application.
Description
Tesis inédita de la Universidad Complutense de Madrid, Facultad de Ciencias Económicas y Empresariales, leída el 07-10-2022