Spanish hate-speech detection in football
- Alcaraz Mármol, Gema
- Valencia García, Rafael
- Montesinos-Cánovas, Esteban
- García-Sánchez, Francisco
- García-Díaz, José Antonio
ISSN: 1135-5948
Año de publicación: 2023
Número: 71
Páginas: 15-27
Tipo: Artículo
Otras publicaciones en: Procesamiento del lenguaje natural
Resumen
En los últimos años, el Procesamiento del Lenguaje Natural (PLN) se ha aplicado con éxito a diversas tareas, como la elaboración de perfiles de autor, la detección de negaciones o la detección de discursos de odio. Para la identificación de odio a partir de texto, es posible explotar modelos del lenguaje preentrenados que permitan construir clasificadores de alto rendimiento utilizando un enfoque de aprendizaje por transferencia (en inglés, transfer learning). En este trabajo, se presentan los resultados de entrenar y evaluar clasificadores preentrenados de última generación basados en Transformers. Los modelos explorados se ajustan (en inglés, fine tune) utilizando un corpus en español sobre el discurso de odio en el futbol que se ha compilado como parte de esta investigación. El corpus contiene un total de 7.483 tuits relacionados con el futbol que han sido anotados manualmente bajo cuatro categorías: agresivo, racista, misógino y seguro. Se utilizó un enfoque multietiqueta, que permite etiquetar el mismo tuit con más de una clase. Los mejores resultados, con un macro F1-score del 88,713%, se han obtenido mediante una combinación de los modelos utilizando la estrategia de Knowledge Integration.
Referencias bibliográficas
- Ali, R., U. Farooq, U. Arshad, W. Shahzad, and M. O. Beg. 2022. Hate speech detection on twitter using transfer learning Computer Speech & Language, 74:101365
- Alkomah, F. and X. Ma. 2022. A literature review of textual hate speech detection methods and datasets. Information, 13(6):273
- Arango, A., J. Pérez, and B. Poblete. 2022 Hate speech detection is not as easy as you may think: A closer look at model validation (extended version). Information Systems, 105:101584
- Bilal, M., A. Khan, S. Jan, and S. Musa 2022. Context-aware deep learning model for detection of roman urdu hate speech on social media platform. IEEE Access, 10:121133–121151
- Cañete, J., G. Chaperon, R. Fuentes, J.-H Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020
- Cañete, J., S. Donoso, F. Bravo-Márquez, A. Carvallo, and V. Araujo. 2022 ALBETO and DistilBETO: Lightweight spanish language models. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 4291–4298. European Language Resources Association
- Chiril, P., E. W. Pamungkas, F. Benamara, V. Moriceau, and V. Patti. 2022. Emotionally informed hate speech detection: A multi-target perspective. Cognitive Computation, 14(1):322–352, Jan
- Cleland, J. 2014. Racism, football fans, and online message boards: How social media has added a new dimension to racist discourse in english football. Journal of Sport and Social Issues, 38(5):415–431
- Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised crosslingual representation learning at scale In D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451 Association for Computational Linguistics
- de la Rosa, J., E. G. Ponferrada, M. Romero, P. Villegas, P. González de Prado Salas, and M. Grandury. 2022. BERTIN: efficient pre-training of a spanish language model using perplexity sampling. Proces del Leng. Natural, 68:13–23
- Devlin, J., M. Chang, K. Lee, and K. Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics
- García Días, J. A., F. García Sánchez, and R. Valencia-García. 2023. Smart analysis of economics sentiment in spanish based on linguistic features and transformers IEEE Access, 11:14211–14224
- García Díaz, J. A., P. J. Vivancos-Vicente, ´A. Almela, and R. Valencia-García. 2022 UMUTextStats: A linguistic feature extraction tool for spanish. In N. Calzolari, F. B´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 6035–6044. European Language Resources Association
- García Díaz, J. A., R. Colomo-Palacios, and R. Valencia-García. 2022. Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020. Future Generation Computer Systems, 130:59–74
- García Díaz, J. A., M. Cánovas García, R. Colomo-Palacios, and R. Valencia- García. 2021. Detecting misogyny in spanish tweets. An approach based on linguistics features and word embeddings Future Generation Computer Systems, 114:506–518
- Gutiérrez Fandiño, A., J. Armengol-Estapé, M. Pamies, J. Llop-Palao, J. Silveira- Ocampo, C. P. Carrino, C. Armentano- Oller, C. R. Penagos, A. González-Agirre, and M. Villegas. 2022. María: Spanish language models. Proces. del Leng. Natural, 68:39–60
- He, P., J. Gao, and W. Chen. 2021 DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing CoRR, abs/2111.09543
- Husain, F. and O. Uzuner. 2022. Investigating the effect of preprocessing arabic text on offensive language and hate speech detection ACM Trans. Asian Low-Resour Lang. Inf. Process., 21(4), jan
- Mansur, Z., N. Omar, and S. Tiun. 2023 Twitter hate speech detection: A systematic review of methods, taxonomy analysis, challenges, and opportunities. IEEE Access, 11:16226–16249
- Mathew, B., R. Dutt, P. Goyal, and A. Mukherjee. 2019. Spread of hate speech in online social media. In Proceedings of the 10th ACM Conference on Web Science, WebSci ’19, page 173–182, New York, NY, USA. Association for Computing Machinery
- Mehta, H. and K. Passi. 2022. Social media hate speech detection using explainable artificial intelligence (XAI). Algorithms, 15(8):291
- Min, B., H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth. 2021. Recent advances in natural language processing via large pre-trained language models: A survey CoRR, abs/2111.01243
- Mosca, E., F. Szigeti, S. Tragianni, D. Gallagher, and G. Groh. 2022. Shapbased explanation methods: A review for NLP interpretability. In N. Calzolari, C. Huang, H. Kim, J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and S. Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 4593–4603. International Committee on Computational Linguistics
- Mozafari, M., R. Farahbakhsh, and N. Crespi. 2022. Cross-lingual fewshot hate speech and offensive language detection using meta learning. IEEE Access, 10:14880–14896
- Oliveira, L. and J. Azevedo. 2022. Using social media categorical reactions as a gateway to identify hate speech in covid-19 news. SN Computer Science, 4(1):11, Oct
- Omar, M., S. Choi, D. Nyang, and D. Mohaisen 2022. Robust natural language processing: Recent advances, challenges, and future directions. IEEE Access, 10:86038–86056
- Paz, M. A., J. Montero-Díaz, and A. Moreno- Delgado. 2020. Hate speech: A systematized review. SAGE Open, 10(4):2158244020973022
- Plaza del Arco, F. M., M. D. Molina- González, L. A. Ureña López, and M. T Martín Valdivia. 2021. Comparing pretrained language models for spanish hate speech detection. Expert Syst. Appl., 166:114120
- Poletto, F., V. Basile, M. Sanguinetti, C. Bosco, and V. Patti. 2021. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55:477–523
- Reimers, N. and I. Gurevych. 2019 Sentence-BERT: Sentence embeddings using siamese BERT-networks. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980– 3990. Association for Computational Linguistics
- Roy, P. K., S. Bhawal, and C. N. Subalalitha 2022. Hate speech and offensive language detection in dravidian languages using deep ensemble framework. Computer Speech & Language, 75:101386
- Tausczik, Y. R. and J. W. Pennebaker. 2010 The psychological meaning of words: Liwc and computerized text analysis methods Journal of language and social psychology, 29(1):24–54
- Vasconcelos, M., J. Almeida, P. Cavalin, and C. Pinhanez. 2019. Live it up: Analyzing emotions and language use in tweets during the soccer world cup finals. In Proceedings of the 10th ACM Conference on Web Science, pages 293–294
- Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998– 6008
- Wullach, T., A. Adler, and E. Minkov 2022. Character-level hypernetworks for hate speech detection. Expert Syst. Appl., 205:117571
- Xue, L., N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-T¨ur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 483–498. Association for Computational Linguistics
- Zhang, X., Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han, and A. El-Kishky 2022. TwHIN-BERT: A socially-enriched pre-trained language model for multilingual tweet representations. CoRR, abs/2209.07562