Spanish hate-speech detection in football

Alcaraz Mármol, Gema; Valencia García, Rafael; Montesinos-Cánovas, Esteban; García-Sánchez, Francisco; García-Díaz, José Antonio

Spanish hate-speech detection in football

Alcaraz Mármol, Gema
Valencia García, Rafael
Montesinos-Cánovas, Esteban
García-Sánchez, Francisco
García-Díaz, José Antonio

Revista:

Procesamiento del lenguaje natural

ISSN: 1135-5948

Any de publicació: 2023

Número: 71

Pàgines: 15-27

Tipus: Article

DIALNET GOOGLE SCHOLAR RUA editor

Altres publicacions en: Procesamiento del lenguaje natural

Resum

In the last few years, Natural Language Processing (NLP) tools have been successfully applied to a number of different tasks, including author profiling, negation detection or hate speech detection, to name but a few. For the identification of hate speech from text, pre-trained language models can be leveraged to build high-performing classifiers using a transfer learning approach. In this work, we train and evaluate state-of-the-art pre-trained classifiers based on Transformers. The explored models are fine-tuned using a hate speech corpus in Spanish that has been compiled as part of this research. The corpus contains a total of 7,483 football-related tweets that have been manually annotated under four categories: aggressive, racist, misogynist, and safe. A multi-label approach is used, allowing the same tweet to be labeled with more than one class. The best results, with a macro F1-score of 88.713%, have been obtained by a combination of the models using Knowledge Integration.

Referències bibliogràfiques

Ali, R., U. Farooq, U. Arshad, W. Shahzad, and M. O. Beg. 2022. Hate speech detection on twitter using transfer learning Computer Speech & Language, 74:101365
Alkomah, F. and X. Ma. 2022. A literature review of textual hate speech detection methods and datasets. Information, 13(6):273
Arango, A., J. Pérez, and B. Poblete. 2022 Hate speech detection is not as easy as you may think: A closer look at model validation (extended version). Information Systems, 105:101584
Bilal, M., A. Khan, S. Jan, and S. Musa 2022. Context-aware deep learning model for detection of roman urdu hate speech on social media platform. IEEE Access, 10:121133–121151
Cañete, J., G. Chaperon, R. Fuentes, J.-H Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020
Cañete, J., S. Donoso, F. Bravo-Márquez, A. Carvallo, and V. Araujo. 2022 ALBETO and DistilBETO: Lightweight spanish language models. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 4291–4298. European Language Resources Association
Chiril, P., E. W. Pamungkas, F. Benamara, V. Moriceau, and V. Patti. 2022. Emotionally informed hate speech detection: A multi-target perspective. Cognitive Computation, 14(1):322–352, Jan
Cleland, J. 2014. Racism, football fans, and online message boards: How social media has added a new dimension to racist discourse in english football. Journal of Sport and Social Issues, 38(5):415–431
Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised crosslingual representation learning at scale In D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451 Association for Computational Linguistics
de la Rosa, J., E. G. Ponferrada, M. Romero, P. Villegas, P. González de Prado Salas, and M. Grandury. 2022. BERTIN: efficient pre-training of a spanish language model using perplexity sampling. Proces del Leng. Natural, 68:13–23
Devlin, J., M. Chang, K. Lee, and K. Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics
García Días, J. A., F. García Sánchez, and R. Valencia-García. 2023. Smart analysis of economics sentiment in spanish based on linguistic features and transformers IEEE Access, 11:14211–14224
García Díaz, J. A., P. J. Vivancos-Vicente, ´A. Almela, and R. Valencia-García. 2022 UMUTextStats: A linguistic feature extraction tool for spanish. In N. Calzolari, F. B´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 6035–6044. European Language Resources Association
García Díaz, J. A., R. Colomo-Palacios, and R. Valencia-García. 2022. Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020. Future Generation Computer Systems, 130:59–74
García Díaz, J. A., M. Cánovas García, R. Colomo-Palacios, and R. Valencia- García. 2021. Detecting misogyny in spanish tweets. An approach based on linguistics features and word embeddings Future Generation Computer Systems, 114:506–518
Gutiérrez Fandiño, A., J. Armengol-Estapé, M. Pamies, J. Llop-Palao, J. Silveira- Ocampo, C. P. Carrino, C. Armentano- Oller, C. R. Penagos, A. González-Agirre, and M. Villegas. 2022. María: Spanish language models. Proces. del Leng. Natural, 68:39–60
He, P., J. Gao, and W. Chen. 2021 DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing CoRR, abs/2111.09543
Husain, F. and O. Uzuner. 2022. Investigating the effect of preprocessing arabic text on offensive language and hate speech detection ACM Trans. Asian Low-Resour Lang. Inf. Process., 21(4), jan
Mansur, Z., N. Omar, and S. Tiun. 2023 Twitter hate speech detection: A systematic review of methods, taxonomy analysis, challenges, and opportunities. IEEE Access, 11:16226–16249
Mathew, B., R. Dutt, P. Goyal, and A. Mukherjee. 2019. Spread of hate speech in online social media. In Proceedings of the 10th ACM Conference on Web Science, WebSci ’19, page 173–182, New York, NY, USA. Association for Computing Machinery
Mehta, H. and K. Passi. 2022. Social media hate speech detection using explainable artificial intelligence (XAI). Algorithms, 15(8):291
Min, B., H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth. 2021. Recent advances in natural language processing via large pre-trained language models: A survey CoRR, abs/2111.01243
Mosca, E., F. Szigeti, S. Tragianni, D. Gallagher, and G. Groh. 2022. Shapbased explanation methods: A review for NLP interpretability. In N. Calzolari, C. Huang, H. Kim, J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and S. Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 4593–4603. International Committee on Computational Linguistics
Mozafari, M., R. Farahbakhsh, and N. Crespi. 2022. Cross-lingual fewshot hate speech and offensive language detection using meta learning. IEEE Access, 10:14880–14896
Oliveira, L. and J. Azevedo. 2022. Using social media categorical reactions as a gateway to identify hate speech in covid-19 news. SN Computer Science, 4(1):11, Oct
Omar, M., S. Choi, D. Nyang, and D. Mohaisen 2022. Robust natural language processing: Recent advances, challenges, and future directions. IEEE Access, 10:86038–86056
Paz, M. A., J. Montero-Díaz, and A. Moreno- Delgado. 2020. Hate speech: A systematized review. SAGE Open, 10(4):2158244020973022
Plaza del Arco, F. M., M. D. Molina- González, L. A. Ureña López, and M. T Martín Valdivia. 2021. Comparing pretrained language models for spanish hate speech detection. Expert Syst. Appl., 166:114120
Poletto, F., V. Basile, M. Sanguinetti, C. Bosco, and V. Patti. 2021. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55:477–523
Reimers, N. and I. Gurevych. 2019 Sentence-BERT: Sentence embeddings using siamese BERT-networks. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980– 3990. Association for Computational Linguistics
Roy, P. K., S. Bhawal, and C. N. Subalalitha 2022. Hate speech and offensive language detection in dravidian languages using deep ensemble framework. Computer Speech & Language, 75:101386
Tausczik, Y. R. and J. W. Pennebaker. 2010 The psychological meaning of words: Liwc and computerized text analysis methods Journal of language and social psychology, 29(1):24–54
Vasconcelos, M., J. Almeida, P. Cavalin, and C. Pinhanez. 2019. Live it up: Analyzing emotions and language use in tweets during the soccer world cup finals. In Proceedings of the 10th ACM Conference on Web Science, pages 293–294
Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998– 6008
Wullach, T., A. Adler, and E. Minkov 2022. Character-level hypernetworks for hate speech detection. Expert Syst. Appl., 205:117571
Xue, L., N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-T¨ur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 483–498. Association for Computational Linguistics
Zhang, X., Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han, and A. El-Kishky 2022. TwHIN-BERT: A socially-enriched pre-trained language model for multilingual tweet representations. CoRR, abs/2209.07562

Fuente de los datos: Dialnet