Spanish hate-speech detection in football

  1. Alcaraz Mármol, Gema
  2. Valencia García, Rafael
  3. Montesinos-Cánovas, Esteban
  4. García-Sánchez, Francisco
  5. García-Díaz, José Antonio
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Any de publicació: 2023

Número: 71

Pàgines: 15-27

Tipus: Article

Altres publicacions en: Procesamiento del lenguaje natural

Resum

In the last few years, Natural Language Processing (NLP) tools have been successfully applied to a number of different tasks, including author profiling, negation detection or hate speech detection, to name but a few. For the identification of hate speech from text, pre-trained language models can be leveraged to build high-performing classifiers using a transfer learning approach. In this work, we train and evaluate state-of-the-art pre-trained classifiers based on Transformers. The explored models are fine-tuned using a hate speech corpus in Spanish that has been compiled as part of this research. The corpus contains a total of 7,483 football-related tweets that have been manually annotated under four categories: aggressive, racist, misogynist, and safe. A multi-label approach is used, allowing the same tweet to be labeled with more than one class. The best results, with a macro F1-score of 88.713%, have been obtained by a combination of the models using Knowledge Integration.

Referències bibliogràfiques

  • Ali, R., U. Farooq, U. Arshad, W. Shahzad, and M. O. Beg. 2022. Hate speech detection on twitter using transfer learning Computer Speech & Language, 74:101365
  • Alkomah, F. and X. Ma. 2022. A literature review of textual hate speech detection methods and datasets. Information, 13(6):273
  • Arango, A., J. Pérez, and B. Poblete. 2022 Hate speech detection is not as easy as you may think: A closer look at model validation (extended version). Information Systems, 105:101584
  • Bilal, M., A. Khan, S. Jan, and S. Musa 2022. Context-aware deep learning model for detection of roman urdu hate speech on social media platform. IEEE Access, 10:121133–121151
  • Cañete, J., G. Chaperon, R. Fuentes, J.-H Ho, H. Kang, and J. Pérez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020
  • Cañete, J., S. Donoso, F. Bravo-Márquez, A. Carvallo, and V. Araujo. 2022 ALBETO and DistilBETO: Lightweight spanish language models. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 4291–4298. European Language Resources Association
  • Chiril, P., E. W. Pamungkas, F. Benamara, V. Moriceau, and V. Patti. 2022. Emotionally informed hate speech detection: A multi-target perspective. Cognitive Computation, 14(1):322–352, Jan
  • Cleland, J. 2014. Racism, football fans, and online message boards: How social media has added a new dimension to racist discourse in english football. Journal of Sport and Social Issues, 38(5):415–431
  • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised crosslingual representation learning at scale In D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8440–8451 Association for Computational Linguistics
  • de la Rosa, J., E. G. Ponferrada, M. Romero, P. Villegas, P. González de Prado Salas, and M. Grandury. 2022. BERTIN: efficient pre-training of a spanish language model using perplexity sampling. Proces del Leng. Natural, 68:13–23
  • Devlin, J., M. Chang, K. Lee, and K. Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, and T. Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACLHLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics
  • García Días, J. A., F. García Sánchez, and R. Valencia-García. 2023. Smart analysis of economics sentiment in spanish based on linguistic features and transformers IEEE Access, 11:14211–14224
  • García Díaz, J. A., P. J. Vivancos-Vicente, ´A. Almela, and R. Valencia-García. 2022 UMUTextStats: A linguistic feature extraction tool for spanish. In N. Calzolari, F. B´echet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, J. Odijk, and S. Piperidis, editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, LREC 2022, Marseille, France, 20-25 June 2022, pages 6035–6044. European Language Resources Association
  • García Díaz, J. A., R. Colomo-Palacios, and R. Valencia-García. 2022. Psychographic traits identification based on political ideology: An author analysis study on spanish politicians’ tweets posted in 2020. Future Generation Computer Systems, 130:59–74
  • García Díaz, J. A., M. Cánovas García, R. Colomo-Palacios, and R. Valencia- García. 2021. Detecting misogyny in spanish tweets. An approach based on linguistics features and word embeddings Future Generation Computer Systems, 114:506–518
  • Gutiérrez Fandiño, A., J. Armengol-Estapé, M. Pamies, J. Llop-Palao, J. Silveira- Ocampo, C. P. Carrino, C. Armentano- Oller, C. R. Penagos, A. González-Agirre, and M. Villegas. 2022. María: Spanish language models. Proces. del Leng. Natural, 68:39–60
  • He, P., J. Gao, and W. Chen. 2021 DeBERTaV3: Improving DeBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing CoRR, abs/2111.09543
  • Husain, F. and O. Uzuner. 2022. Investigating the effect of preprocessing arabic text on offensive language and hate speech detection ACM Trans. Asian Low-Resour Lang. Inf. Process., 21(4), jan
  • Mansur, Z., N. Omar, and S. Tiun. 2023 Twitter hate speech detection: A systematic review of methods, taxonomy analysis, challenges, and opportunities. IEEE Access, 11:16226–16249
  • Mathew, B., R. Dutt, P. Goyal, and A. Mukherjee. 2019. Spread of hate speech in online social media. In Proceedings of the 10th ACM Conference on Web Science, WebSci ’19, page 173–182, New York, NY, USA. Association for Computing Machinery
  • Mehta, H. and K. Passi. 2022. Social media hate speech detection using explainable artificial intelligence (XAI). Algorithms, 15(8):291
  • Min, B., H. Ross, E. Sulem, A. P. B. Veyseh, T. H. Nguyen, O. Sainz, E. Agirre, I. Heintz, and D. Roth. 2021. Recent advances in natural language processing via large pre-trained language models: A survey CoRR, abs/2111.01243
  • Mosca, E., F. Szigeti, S. Tragianni, D. Gallagher, and G. Groh. 2022. Shapbased explanation methods: A review for NLP interpretability. In N. Calzolari, C. Huang, H. Kim, J. Pustejovsky, L. Wanner, K. Choi, P. Ryu, H. Chen, L. Donatelli, H. Ji, S. Kurohashi, P. Paggio, N. Xue, S. Kim, Y. Hahm, Z. He, T. K. Lee, E. Santus, F. Bond, and S. Na, editors, Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 4593–4603. International Committee on Computational Linguistics
  • Mozafari, M., R. Farahbakhsh, and N. Crespi. 2022. Cross-lingual fewshot hate speech and offensive language detection using meta learning. IEEE Access, 10:14880–14896
  • Oliveira, L. and J. Azevedo. 2022. Using social media categorical reactions as a gateway to identify hate speech in covid-19 news. SN Computer Science, 4(1):11, Oct
  • Omar, M., S. Choi, D. Nyang, and D. Mohaisen 2022. Robust natural language processing: Recent advances, challenges, and future directions. IEEE Access, 10:86038–86056
  • Paz, M. A., J. Montero-Díaz, and A. Moreno- Delgado. 2020. Hate speech: A systematized review. SAGE Open, 10(4):2158244020973022
  • Plaza del Arco, F. M., M. D. Molina- González, L. A. Ureña López, and M. T Martín Valdivia. 2021. Comparing pretrained language models for spanish hate speech detection. Expert Syst. Appl., 166:114120
  • Poletto, F., V. Basile, M. Sanguinetti, C. Bosco, and V. Patti. 2021. Resources and benchmark corpora for hate speech detection: a systematic review. Language Resources and Evaluation, 55:477–523
  • Reimers, N. and I. Gurevych. 2019 Sentence-BERT: Sentence embeddings using siamese BERT-networks. In K. Inui, J. Jiang, V. Ng, and X. Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 3980– 3990. Association for Computational Linguistics
  • Roy, P. K., S. Bhawal, and C. N. Subalalitha 2022. Hate speech and offensive language detection in dravidian languages using deep ensemble framework. Computer Speech & Language, 75:101386
  • Tausczik, Y. R. and J. W. Pennebaker. 2010 The psychological meaning of words: Liwc and computerized text analysis methods Journal of language and social psychology, 29(1):24–54
  • Vasconcelos, M., J. Almeida, P. Cavalin, and C. Pinhanez. 2019. Live it up: Analyzing emotions and language use in tweets during the soccer world cup finals. In Proceedings of the 10th ACM Conference on Web Science, pages 293–294
  • Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. 2017. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998– 6008
  • Wullach, T., A. Adler, and E. Minkov 2022. Character-level hypernetworks for hate speech detection. Expert Syst. Appl., 205:117571
  • Xue, L., N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, and C. Raffel. 2021. mT5: A massively multilingual pre-trained text-to-text transformer In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-T¨ur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, and Y. Zhou, editors, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 483–498. Association for Computational Linguistics
  • Zhang, X., Y. Malkov, O. Florez, S. Park, B. McWilliams, J. Han, and A. El-Kishky 2022. TwHIN-BERT: A socially-enriched pre-trained language model for multilingual tweet representations. CoRR, abs/2209.07562