On the application of Machine Learning to Model-Driven Engineering

  1. Hernández López, José Antonio
Dirigée par:
  1. Jesús Sánchez Cuadrado Directeur

Université de défendre: Universidad de Murcia

Fecha de defensa: 22 juin 2023

Jury:
  1. Juan de Lara Jaramillo President
  2. Catalina Martínez Costa Secrétaire
  3. Artur Boronat Rapporteur
Département:
  1. Informática y Sistemas

Type: Thèses

Résumé

Model-Driven Engineering (MDE) is a Software Engineering methodology that raises models to first-class artifacts of the software engineering process. In this way, models are no longer used only to document software, but they are also employed to describe, simulate, and generate code. This paradigm has been proven to be successful in a lot of scenarios where the complexity of the systems is not trivial and cannot be tackled by traditional software engineering practices. However, the quality and ease of use of the modelling tools still remain a limiting factor that prevents MDE from being used more extensively in practice. On the other hand, Machine Learning (ML) is an Artificial Intelligence paradigm that has been applied to solve complex tasks in a wide variety of application domains. ML algorithms learn a mathematical function that maps a set of inputs to a set of outputs. This function is usually learnt from the data and, depending on the data's shape, these algorithms can be roughly classified into supervised, unsupervised, and reinforcement learning. ML has been successfully applied in the Software Engineering field. Particularly, it has been used to improve integrated development environments (IDEs) for code. This has been done by incorporating ML models trained on code auto-completion, documentation generation, defect detection, and test-case generation. Unlike code IDEs, modelling tools have not been able yet to take advantage of ML techniques. One reason is the absence of high-quality and extensive curated datasets, either labelled or unlabelled, to train ML models. This aspect limits the generalization of such ML models and prevents the application of deep learning technologies. Moreover, ML models cannot receive as input a raw modelling artifact and it must be transformed into a representation suitable to feed the ML model (e.g., numeric vectors). There have been several proposals to address this, but this aspect has not been systematically studied for a given MDE task and ML model. Finally, the number of MDE tasks addressed with ML is scarce and the approaches used are simplistic. This thesis tries to overcome the aforementioned issues by bridging the gap between ML and MDE. The first result of the thesis is the MAR search engine. This search engine collects several software artifacts of different types, analyses and stores them in a centralized database, providing query facilities to get relevant models. From an ML point of view, this result can be seen as an extensive unlabelled dataset that can be used for unsupervised learning. In addition, MAR is currently the model search engine with the largest collection of models. To address supervised learning scenarios, a set of UML and Ecore models were taken from MAR and labelled with their main category and secondary labels of interest. As a result, the ModelSet dataset was created, which is the largest corpus of labelled models in the context of ML for MDE. This dataset enables interesting applications of supervised learning to MDE. Using ModelSet as the target dataset, a comparative study was carried out to compare different model encodings and ML algorithms in the context of model classification. This study shows that for model classification simple representations based on text (e.g., TF-IDF) and simple models (e.g., SVM or feed-forward neural networks) are good choices. Also, it shows that semantic embeddings can also improve the performance of these models in the case of UML. Finally, deep learning architectures have been employed to solve a complex MDE problem: the generation of realistic models. In particular, a model generator based on autoregressive deep neural networks is proposed, which achieves state-of-the-art performance in terms of realism.