Modelos de análisis semántico de información y conocimiento genético y genómico para el estudio de enfermedades genéticas y cáncer

Almagro Hernandez, Gines

Modelos de análisis semántico de información y conocimiento genético y genómico para el estudio de enfermedades genéticas y cáncer

Almagro Hernandez, Gines

Dirigida per:

Jesualdo Tomás Fernández Breis Director

Universitat de defensa: Universidad de Murcia

Fecha de defensa: 17 de de novembre de 2020

Tribunal:

Manuel Franco Nicolás President
Horacio E. Pérez Sánchez Secretari/ària
M. Carme Camps Febrer Vocal

Departament:

Informática y Sistemas

Tipus: Tesi

Teseo: 153759 DIALNET DIGITUM editor

Resum

Models of semantic analysis of genetic and genomic information and knowledge for the study of genetic diseases and cancer Author: Ginés Almagro Hernández Director: Dr. Jesualdo Tomás Fernández Breis An experiment (ChIP-seq) conducted to study the behavior of a specific DNA-binding protein in a specific cell line under a given biological condition, consisting of an immuno-precipitation stage of chromatin fragments (ChIP) and their subsequent identification by sequencing technology (seq) with techniques called Next Generation Sequencing (NGS). The methods of analysis of the results (enriched regions or peaks) of this type of experiments, implemented so far, have in common two main characteristics: (i) The treatment of the uncertainty involved in these results through the use of statistical methods based on dichotomous models. (ii) The type of results of these analyses consist of relating a functional element (gene, Gene Ontology term, metabolic pathways) to a p-value calculated by means of an enrichment test. Objectives The main objective of this thesis is the design, implementation and evaluation of a multi-level analytical framework, scalable, flexible, with a solid n-dimensional statistical basis and mathematical interpretation, based on the elaboration of knowledge models that provide the semantics and structure necessary to deal with the numerous, heterogeneous and complex existing genomic and biological information. In order to evaluate the behavior on a genomic scale of the protein under study. In an attempt to achieve this, the secondary objectives defined are: (i) To address the uncertainty that accompanies the results of this type of experiment through statistical methods, based on a multivariate hypergeometric distribution, not used until now. (ii) Create standards of performance in the necessary analyses and models, in order to generate reference profiles that describe a specific terna (protein, cell line, biological condition). (iii) To allow the comparison, sharing, evaluation and integration of data obtained from these types of experiments, regardless of where they were conducted. Methodology Design, development and implementation of knowledge models: (i) The Genome Model, which houses information on the genome under study, both on its structure (chromosomes, gaps, autosomal regions...), and on the functional entities that compose it (genes from various biotypes, functional sequences such as enhancers, insulators...). (ii) The Gene Model, which houses information about the functional entities that encode some functional product, whether it is a protein, tRNA, rRNA, etc. (iii) The Functional Model, which houses information on functional resources, such as metabolic pathways, functional terms, etc. Conversion of previous knowledge models into probabilistic models, representing a finite population of possible binding sites of the protein to the genome of the organism under study. Design of an analytical framework that interrelates the previous probabilistic models with the "peaks" of the experiment through a mathematical and standardized analysis, which determines the behavior of the protein under study at different levels of resolution, such as the Region level, Gene level and Functional level. Validation of the multilevel analytical framework developed in this thesis taking the human genome as a model. For this purpose, 19 ChIP-seq experiments have been taken from the Remap 2020 public database, grouped in 7 studies on the MYC protein, in the P493-6 and U2OS cell lines. Results and Conclusions The results obtained verify the main hypothesis of this thesis, that the peaks obtained from a ChIP-seq experiment can be modeled as the result of a random experiment fitting a multivariate hypergeometric distribution. Thus providing a new framework of analysis on this type of experiments. Which minimizes the effects of the uncertainty that accompanies the results of these experiments, generating new information and knowledge about the behavior of the protein under study, from innovative perspectives and different from those used to date.