A unified data metamodel for relational and NoSQL databasesschema extraction and query

  1. Fernández Candel, Carlos Javier
Supervised by:
  1. Jesús Joaquín García Molina Director
  2. Diego Sevilla Ruiz Director

Defence university: Universidad de Murcia

Fecha de defensa: 30 June 2022

Committee:
  1. Antonio Vallecillo Moreno Chair
  2. José Ramón Hoyos Barceló Secretary
  3. Pablo Javier Tuya González Committee member
Department:
  1. Computer Science and Systems Engineering

Type: Thesis

Abstract

Goals. This thesis deals with (i) the definition of a unified data model with the aim of integrating the relational model with the data models of the four most common NoSQL paradigms: columnar, document, key-value and graphs; (ii) The definition of bidirectional mappings between the unified data model and each of the data models of each database system; (iii) The implementation of a common strategy for extracting schemas from different types of databases by implementing the defined canonical mappings; (iv) The development of a Model-Driven Engineering process for code analysis of applications to obtain the database schema and perform a database refactoring; (v) The design and implementation of a generic schema query language, which allows to query the schemas represented in the unified data model; (vi) the creation of graphical notation to visualize the schemas; and (vii) the conducting a study to explore the possibility of using the unified data model to define a database query language. Methodology. To accomplish the objectives of the thesis, we have followed the Design Science Research Methodology (DSRM). These methods propose iterative research processes organized in several stages or activities to achieve an objective. The activities that normally constitute these processes are: (i) Problem identification and motivation, (ii) Definition of the objectives of the solution, (iii) Design and development, (iv) Demonstration, (v) Evaluation, and (vi) Conclusions and communication. In a DSRM process the knowledge produced in each iteration is used as feedback to better design and implement the final artifact. Results. This thesis addresses the main problems that arise in the development of generic database tools that integrate the most relevant data models, namely, relational and NoSQL models. Firstly, the definition of a unified metamodel (U-Schema) that integrates relational and NoSQL data models. Secondly, the construction of logical schema extractors for each considered data model. Because the most schema extraction approaches have applied data analysis, we have investigated the code static analysis as an alternative. Thirdly, around the unified metamodel and the set of extractors, we have built a generic schema management tool that includes a schema query language and a schema graphical viewer. Tackling these issues, we faced to the challenges posed by a proposal of NoSQL logical schema that includes structural variations and the most common relationships between database entities. Contributions. This thesis contributes with (i) The first logical unified metamodel that integrates the most widely used database paradigms: Relational and NoSQL. This also entailed the definition of two logical data models for NoSQL systems: one for aggregate based systems (columnar, document, and key-value) and one for graph systems; (ii) The formal specification of the bidirectional mappings between the unified metamodel and the individual data models; (iii) The definition of an architecture with reusable components to create schema extractor from any NoSQL system; (iv) A Model-Driven Engineering approach for logical schema extraction from code analysis and database refactoring; (v) A generic query language designed for U-Schema: SkiQL that allows developers to express queries on logical schemas represented as U-Schema models; (vi) The creation of a graphical notation to visualize the schemas that includes structural variations; (vii) A study on the usefulness of U-Schema to create a generic language to query NoSQL stores of any kind of system; and a comparison of the different generic metamodels proposed to represent database schemas or data formats.