UMUCorpusClassifier: Compilation and evaluation of linguistic corpus for Natural Language Processing tasks

José Antonio García Díaz; Ángela Almela Sánchez-Lafuente; Gema Alcaraz Mármol; Rafael Valencia García

UMUCorpusClassifierCompilation and evaluation of linguistic corpus for Natural Language Processing tasks

Revista:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2020

Número: 65

Páginas: 139-142

Tipo: Artículo

DIALNET GOOGLE SCHOLAR RUA editor

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

La construcción de un corpus anotado es una tarea que consume mucho tiempo. Aunque algunos investigadores han propuesto la anotación automática basada en heurísticas, éstas no siempre son posibles. Además, incluso cuando la anotación es realizada por personas puede haber discrepancias entre los mismos anotadores o de un anotador consigo mismo que influyen en la calidad del corpus. Por tanto, la falta de supervisión sobre el proceso de anotación puede llevar a corpus con baja calidad. En este trabajo, proponemos una demostración de UMUCorpusClassifier, una herramienta PLN para ayudar a los investigadores a compilar corpus y también a coordinar y supervisar el proceso de anotación. Esta herramienta facilita la monitorización diaria y permite detectar inconsistencias durante etapas tempranas del proceso de anotación.

€ Ver financiación

Información de financiación

This demonstration has been supported by the Spanish National Research Agency (AEI) and the European Regional De velopment Fund (FEDER/ERDF) through projects KBS4FIA (TIN2016-76323-R) and LaTe4PSP (PID2019-107652RB-I00). In ad dition, JoséAntonio Garćıa-Díaz has been supported by Banco Santander and University of Murcia through the Doctorado industrial programme.

Financiadores

European Regional Development Fund European Union
- TIN2016-76323-R
Agencia Estatal de Investigación Spain

Referencias bibliográficas

Apolinardo-Arzube, O., J. A. García-Díaz, J. Medina-Moreira, H. Luna-Aveiga, and R. Valencia-Garc´ıa. 2019. Evaluating information-retrieval models and machine-learning classifiers for measuring the social perception towards infectious diseases. Applied Sciences, 9(14):2858.
García-Díaz, J. A., M. Cánovas-García, and R. Valencia-García. 2020. Ontologydriven aspect-based sentiment analysis classification: An infodemiological case study regarding infectious diseases in latin america. Future Generation Computer Systems, 112:614–657.
Go, A., R. Bhayani, and L. Huang. 2009. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009.
Grave, E., P. Bojanowski, P. Gupta, A. Joulin, and T. Mikolov. 2018. Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893.
Krippendorff, K. 2018. Content analysis: An introduction to its methodology. Sage publications.
Medina-Moreira, J., J. A. García-Díaz, O. Apolinardo-Arzube, H. Luna-Aveiga, and R. Valencia-García. 2019. Mining twitter for measuring social perception towards diabetes and obesity in central america. In International Conference on Technologies and Innovation, pages 81–94. Springer.
Medina-Moreira, J., J. O. Salavarria-Melo, K. Lagos-Ortiz, H. Luna-Aveiga, and R. Valencia-García. 2018. Opinion mining for measuring the social perception of infectious diseases. an infodemiology approach. In Proceedings of the Technologies and Innovation: 4th International Conference, CITI, page 229. Springer.
Mozetiˇc, I., M. Grˇcar, and J. Smailovi´c. 2016. Multilingual twitter sentiment classification: The role of human annotators. PloS one, 11(5).
Pak, A. and P. Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, volume 10, pages 1320–1326.
Salas-Zárate, M. d. P., M. A. ParedesValverde, M. A. Rodríguez-García, R. Valencia-García, and G. AlorHernández. 2017. Automatic detection of satire in twitter: A psycholinguistic-based approach. Knowl. Based Syst., 128:20–33.
Singh, A., N. Thakur, and A. Sharma. 2016. A review of supervised machine learning algorithms. In 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), pages 1310–1315. Ieee.

Fuente de los datos: Dialnet