valuation of transformer-based models for punctuation and capitalization restoration in Catalan and Galician

  1. Vivancos Vicente, Pedro J.
  2. Valencia García, Rafael
  3. Pan, Ronghao
  4. García-Díaz, José Antonio
Aldizkaria:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Argitalpen urtea: 2023

Zenbakia: 70

Orrialdeak: 27-38

Mota: Artikulua

Beste argitalpen batzuk: Procesamiento del lenguaje natural

Laburpena

En los últimos años, el rendimiento de sistemas de Reconocimiento Automático del habla ha aumentado considerablemente gracias a nuevos métodos de deep learning. Sin embargo, la salida bruta de estos sistemas consiste en secuencias de palabras sin mayúsculas ni signos de puntuación. Recuperar esta información mejora la legibilidad y permite su posterior uso en otros modelos de PLN. La mayoría de las soluciones existentes se centran únicamente en inglés; aunque recientemente han surgido nuevos modelos de restauración de la puntuación en español. Sin embargo, ninguno se centra en gallego y catalán. En este sentido, proponemos un sistema de restauración de mayúsculas y puntuación basado en modelos Transformers para estos idiomas. Ambos modelos tienen un rendimiento muy bueno: 90,2% para el gallego y 90,86% para el catalán. Además, también tienen la capacidad de identificar nombres propios, nombres de países y organizaciones para la restauración de mayúsculas.

Erreferentzia bibliografikoak

  • Alam, T., A. Khan, and F. Alam. 2020. Punctuation restoration using transformer models for high-and low-resource languages. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 132–142, Online, November. Association for Computational Linguistics.
  • Armengol-Estape, J., C. P. Carrino, C. Rodriguez-Penagos, O. de Gibert Bonet, C. Armentano-Oller, A. Gonzalez-Agirre, M. Melero, and M. Villegas. 2021. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4933–4946, Online, August. Association for Computational Linguistics.
  • Bannard, C. and C. Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 597– 604, Ann Arbor, Michigan, June. Association for Computational Linguistics.
  • Banon, M., P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Espl`a-Gomis, M. L. Forcada, A. Kamran, F. Kirefu, P. Koehn, S. Ortiz Rojas, L. Pla Sempere, G. Ramırez-Sanchez, E. Sarrıas, M. Strelec, B. Thompson, W. Waites, D. Wiggins, and J. Zaragoza. 2020. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online, July. Association for Computational Linguistics.
  • Basili, R., C. Bosco, R. Delmonte, A. Moschitti, and M. Simi, editors. 2015. Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project, volume 589 of Studies in Computational Intelligence. Springer.
  • Bostrom, K. and G. Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. CoRR, abs/2004.03720.
  • Canete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Perez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.
  • Che, X., C. Wang, H. Yang, and C. Meinel. 2016. Punctuation prediction for unsegmented transcript based on word vector. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 654– 658, Portoroˇz, Slovenia, May. European Language Resources Association (ELRA).
  • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised crosslingual representation learning at scale CoRR, abs/1911.02116.
  • Courtland, M., A. Faulkner, and G. McElvain. 2020. Efficient automatic punctuation restoration using bidirectional transformers with robust inference. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 272–279, Online, July. Association for Computational Linguistics.
  • David Vilares, Marcos Garcia, C. G.-R. 2021. Bertinho: Galician bert representations. Procesamiento del Lenguaje Natural, 66(0):13–26.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.
  • Federico, M., M. Cettolo, L. Bentivogli, M. Paul, and S. St¨uker. 2012. Overview of the IWSLT 2012 evaluation campaign. In Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 12–33, Hong Kong, Table of contents, December 6-7.
  • Gonzalez-Docasal, A., A. Garcıa-Pablos, H. Arzelus, and A. Alvarez. 2021. Autopunct: A bert-based automatic punctuation and capitalisation system for spanish and basque. Procesamiento del Lenguaje Natural, 67(0):59–68.
  • Jones, D., F. Wolf, E. Gibson, E. Williams, E. Fedorenko, D. Reynolds, and M. Zissman. 2003. Measuring the readability of automatic speech-to-text transcripts. In 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1- 4, 2003. ISCA, 09.
  • Ljubesic, N. and A. Toral. 2014. cawac - a web corpus of catalan and its ap- plication to language modeling and machine translation. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. European Language Resources Association (ELRA).
  • Ortiz Suarez, P. J., L. Romary, and B. Sagot. 2020. A monolingual approach to contextualized word embeddings for midresource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online, July. Association for Computational Linguistics.
  • Peitz, S., M. Freitag, A. Mauser, and H. Ney. 2011. Modeling punctuation prediction as machine translation. In Proceedings of the 8th International Workshop on Spoken Language Translation: Papers, pages 238–245, 12.
  • Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
  • Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In N. C. C. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Language Resources Association (ELRA).
  • Tilk, O. and T. Alumae. 2016. Bidirectional recurrent neural network with attention mechanism for punctuation restoration. In INTERSPEECH.
  • Yi, J. and J. Tao. 2019. Self-attention based model for punctuation prediction using word and speech embeddings. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7270–7274.
  • Yi, J., J. Tao, Y. Bai, Z. Tian, and C. Fan. 2020. Adversarial transfer learning for punctuation restoration.
  • Zhu, X., S. Gardiner, D. Rossouw, T. Roldan, and S. Corston-Oliver. 2022. Punctuation restoration in Spanish customer support transcripts using transfer learning. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 80–89, Hybrid, July. Association for Computational Linguistics.