Detección de urls maliciosas mediante aprendizaje automático
Loading...
Download
Official URL
Full text at PDC
Publication date
2024
Authors
Advisors (or tutors)
Editors
Journal Title
Journal ISSN
Volume Title
Publisher
Citation
Abstract
Los ciberataques han experimentado un incremento significativo, es común encontrarse con palabras como phishing, hackeos, fraudes informática, entre otros, en nuestro día a día. Si bien los avances tecnológicos han contribuido a la mejora de vida, también han traído consigo técnicas más sofisticadas para atacar y estafar a los usuarios. Una de las técnicas empleadas por los atacantes es la propagación de URLs maliciosas. El simple acto de realizar clic en URLs erróneas puede traer consigo grandes consecuencias para el usuario. En el presente trabajo de máster aborda la detección de URLs mediante aprendizaje automático. Un enfoque que ha demostrado ser eficaz para combatir las URL maliciosas. Durante la fase inicial se llevó a cabo una recopilación y creación de un conjunto de datos a partir de diferentes fuentes con un total de cuatro categorías de URL: phishing, malware, defacement y benigna. Este proceso fue seguido de la extracción de atributos asociados a las URL, extrayendo un total de 16 atributos como longitud de la URL, días que lleva activo el dominio, entropía y atributos similares. Además, se aplicaron técnicas
de preprocesamiento de datos como la eliminación de valores duplicados y escalado de los datos para minimizar los valores atípicos. En la fase de experimentación, se seleccionaron cinco modelos de aprendizaje automático: bosques aleatorios, regresión logística, support vector machine, k vecinos más próximos y redes neuronales. La fase de experimentación se finalizó con una evaluación de las métricas y rendimiento de cada modelo para la
dentificación y clasificación de URLs. Por último, como parte del proyecto, se desarrolló una interfaz gráfica que permite la interacción directa con los modelos entrenados, facilitando así la compresión de los resultados.
Cyber-attacks have experienced a significant increase, it is common to come across words such as phishing, hacking, computer fraud, among others, on in our daily lives. While technological advances have contributed to the improvement of life, they have also brought with them more sophisticated techniques to attack and defraud users. One of the techniques used by attackers is the propagation of malicious URLs. The simple act of clicking on the wrong URLs can have great consequences for the user. This master's thesis deals with the detection of URLs using machine learning. An approach that has proven to be effective in combating malicious URLs. During the initial phase, a dataset was collected and created from different sources with a total of four categories of URLs: phishing, malware, defacement and benign. This process was followed by the extraction of features associated with the URLs, extracting a total of 16 attributes such as URL length, days the domain has been active, entropy and similar features. In addition, data pre-processing techniques such as duplicate value removal and data scaling were applied to minimize outliers. In the experimentation phase, five machine learning models were selected: random forests, logistic regression, support vector machine, k nearest neighbors and neural networks. The experimentation phase ended with an evaluation of the metrics and performance of each model for URL identification and classification. Finally, as part of the project, a graphical interface was developed that allows direct interaction with the trained models, thus facilitating the understanding of the results.
Cyber-attacks have experienced a significant increase, it is common to come across words such as phishing, hacking, computer fraud, among others, on in our daily lives. While technological advances have contributed to the improvement of life, they have also brought with them more sophisticated techniques to attack and defraud users. One of the techniques used by attackers is the propagation of malicious URLs. The simple act of clicking on the wrong URLs can have great consequences for the user. This master's thesis deals with the detection of URLs using machine learning. An approach that has proven to be effective in combating malicious URLs. During the initial phase, a dataset was collected and created from different sources with a total of four categories of URLs: phishing, malware, defacement and benign. This process was followed by the extraction of features associated with the URLs, extracting a total of 16 attributes such as URL length, days the domain has been active, entropy and similar features. In addition, data pre-processing techniques such as duplicate value removal and data scaling were applied to minimize outliers. In the experimentation phase, five machine learning models were selected: random forests, logistic regression, support vector machine, k nearest neighbors and neural networks. The experimentation phase ended with an evaluation of the metrics and performance of each model for URL identification and classification. Finally, as part of the project, a graphical interface was developed that allows direct interaction with the trained models, thus facilitating the understanding of the results.
Description
Trabajo de Fin de Máster en Internet de las Cosas, Facultad de Informática UCM, Departamento de Ingeniería de Software e Inteligencia Artificial, Curso 2023/2024.