Enhancing DGA-based botnet detection beyond 5G with on-Edge machine learning

Zago, Mattia

Enhancing DGA-based botnet detection beyond 5G with on-Edge machine learning

Zago, Mattia

Dirigida por:

Manuel Gil Pérez Director
Gregorio Martínez Pérez Director

Universidad de defensa: Universidad de Murcia

Fecha de defensa: 22 de julio de 2021

Tribunal:

Victor Abraham Villagrá González Presidente/a
Lorenzo Fernández Maimó Secretario
Michele Carminati Vocal

Departamento:

Ingeniería de la Información y las Comunicaciones

Tipo: Tesis

Teseo: 729434 DIALNET DIGITUM editor

Resumen

Notwithstanding the scientific community's efforts and results, malwares are still wreaking havoc of computer networks. However, independently from the purposes of these malwares, the botnets are characterised by a common point of failure, i.e., the communication channel. Infected devices need to reach out to the Command and Control (C&C) servers to download second-stage infections, perform malicious actions or await further commands. Domain Generation Algorithms (DGAs) have grown to a conventional approach to elude detection algorithms by generating pseudo-random rendezvous-points for the C&C servers. Although many machine-learning (ML)-oriented frameworks have been theorised to identify and intercept DGAs, the problem is yet to be solved. As such, this PhD thesis' scope is to analyse the DGAs' outputs, known as algorithmically generated domain names (AGDs), to provide a set of ML tools and privacy-aware methodologies that help identify these evasive patterns. To be more precise, the objectives achieved throughout this research are twofold. Firstly, this thesis aims to provide a characterisation of the DGAs aspects, including, a comprehensive survey of previous literary contributions, data sources and ML-based approaches. Secondly, it aims to integrate and improve the state-of-the-art by providing methods, strategies and technologies to enable the detection at scale. Specifically, signature patterns are identified in malicious AGDs using natural language processing (NLP) techniques, and the resulting learning models are designed as services to be dynamically deployed anywhere on the network. As a result, this research encompasses literary survey, theory and framework crafting, experiments design and evaluations, and knowledge gaps identification and discussions. Under the compendium modality, the three chapters composing this PhD dissertation are outlined as follows. • Firstly, a state-of-the-art survey on ML approaches to DGA-based botnet detection; the first chapter reports on supervised and unsupervised algorithms, their features sets, the definition of use cases and experiments, and, ultimately, the outline of multiple research challenges to guide the thesis. Eventually, the experimental findings lay the foundations for AGDs formal and verifiable study. • Secondly, a comparative analysis of the data sources to power ML frameworks; the second chapter reports on the published datasets by providing a formal comparison and discussion on multiple orthogonal properties. In the same article, the UMUDGA dataset is introduced as a complete, balanced and up-to-date collection of DGA-related data, featuring 50 DGAs and 30+ million FQDNs. Eventually, the analysis reported in the article suggests that ML solutions based on AGDs pattern recognition are feasible. • Thirdly, a proof-of-concept framework where the detection of DGA-based botnets is deployed as a security service on edge; the third chapter examines architectural Edge AI approaches to enable scalable detection in 5G networks and beyond. In the article, the experiments demonstrate that AGD detection is not only reasonable and achievable, but it is also plausible to expect to have deployed such detection capabilities on the networks' edges and eventually on the users' equipment (UE). In summary, the chapters composing this PhD dissertation promote cohesive research exploring, analysing and, ultimately, tackling the DGA-based botnets. Following this Ariadne's thread, each chapter is self-contained and provides critical insights on the research challenges from a different perspective; together, these contributions depict a clear description of the research niche summarised in the thesis. However, although conclusive on the explored subjects, some questions mooted by this research remain unsolved. Prime among them is whether it will be feasible to provide anonymous, exchangeable, and trustworthy profiles of AGDs to enable collaborative and federated detection models without harming users' privacy.