Análisis y tipificación de errores lingüísticos para una propuesta de mejora de informes médicos en español

  1. López Hernández, Jésica
Supervised by:
  1. Ángela Almela Sánchez-Lafuente Director
  2. Fernando Molina Molina Director
  3. Rafael Valencia García Director

Defence university: Universidad de Murcia

Fecha de defensa: 18 May 2022

Committee:
  1. Pascual Cantos Gómez Chair
  2. Gema Alcaraz Mármol Secretary
  3. Mario Andrés Paredes Valverde Committee member
Department:
  1. English Philology

Type: Thesis

Abstract

The main purpose of this research is the detection, analysis and classification of linguistic errors in medical reports in Spanish. The most current and powerful automatic correction systems, such as neural network-based architectures, require large training data sets for optimal performance. Therefore, artificial error collection and generation for training systems have gained importance in natural language processing, due to the scarcity of available biomedical domain corpora. The development of an error typology from the empirical study of a corpus of medical reports will make it possible to add new patterns to the generation of errors in a more exhaustive way and the creation of more robust models for data processing in medicine. A corpus made up of real medical reports from four specialties (emergency medicine, ICU, general surgery and psychiatry), with more than two million tokens, has been analyzed for error detection and classification. The methodological approach developed has included different detection and automatic correction techniques, including the implementation of a linguistic model based on n-grams, the vector representation of the corpus words from Word2Vec and the grammatical labeling of the corpus. An error calculation and classification method has been developed, and a quantitative and qualitative analysis of the results obtained has been carried out. The results have made it possible to identify similarities and differences between these specialties and have shown that the specialty with the highest rate of errors in medical reports is emergency medicine. Most of the erroneous words are within one edit distance of the corresponding correct word, and a large part of the errors detected are concentrated in a small number of characters and the most common type of error is omission. Many of the errors have consistent reproduction patterns that can be systematized, such as the substitution of characters with phonetic similarities, errors caused by ignorance of the current orthographic norm, and errors derived from the use of the keyboard. To summarize, this doctoral thesis aims to be a contribution to the study of linguistic errors in medical reports in order to provide a base of linguistic knowledge to the existing detection and correction methods for this domain.