valuation of transformer-based models for punctuation and capitalization restoration in Catalan and Galician

  1. Vivancos Vicente, Pedro J.
  2. Valencia García, Rafael
  3. Pan, Ronghao
  4. García-Díaz, José Antonio
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2023

Issue: 70

Pages: 27-38

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

In recent years, the performance of Automatic Speech Recognition systems (ASR) has increased considerably due to new deep learning methods. However, the raw output of an ASR system consists of a sequence of words without capital letters and punctuation marks. Therefore, a capitalization and punctuation restoration system are one of the most important post-processes of ASR to improve readability and to enable the subsequent use of these results in other NLP models. Most models focus solely on English punctuation resolution, and recently new models of Spanish punctuation restoration have emerged. However, none focus on capitalization and punctuation restoration in Galician and Catalan. In this sense, we propose a system for capitalization and punctuation restoration based on Transformers models for Catalan and Galician. Both models perform very well, with an overall performance of 90.2% for Galician and 90.86% for Catalan, and have the ability to identify proper names, country names, and organizations for uppercase restoration.

Bibliographic References

  • Alam, T., A. Khan, and F. Alam. 2020. Punctuation restoration using transformer models for high-and low-resource languages. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 132–142, Online, November. Association for Computational Linguistics.
  • Armengol-Estape, J., C. P. Carrino, C. Rodriguez-Penagos, O. de Gibert Bonet, C. Armentano-Oller, A. Gonzalez-Agirre, M. Melero, and M. Villegas. 2021. Are multilingual models the best choice for moderately under-resourced languages? A comprehensive assessment for Catalan. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4933–4946, Online, August. Association for Computational Linguistics.
  • Bannard, C. and C. Callison-Burch. 2005. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 597– 604, Ann Arbor, Michigan, June. Association for Computational Linguistics.
  • Banon, M., P. Chen, B. Haddow, K. Heafield, H. Hoang, M. Espl`a-Gomis, M. L. Forcada, A. Kamran, F. Kirefu, P. Koehn, S. Ortiz Rojas, L. Pla Sempere, G. Ramırez-Sanchez, E. Sarrıas, M. Strelec, B. Thompson, W. Waites, D. Wiggins, and J. Zaragoza. 2020. ParaCrawl: Web-scale acquisition of parallel corpora. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4555–4567, Online, July. Association for Computational Linguistics.
  • Basili, R., C. Bosco, R. Delmonte, A. Moschitti, and M. Simi, editors. 2015. Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project, volume 589 of Studies in Computational Intelligence. Springer.
  • Bostrom, K. and G. Durrett. 2020. Byte pair encoding is suboptimal for language model pretraining. CoRR, abs/2004.03720.
  • Canete, J., G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, and J. Perez. 2020. Spanish pre-trained bert model and evaluation data. In PML4DC at ICLR 2020.
  • Che, X., C. Wang, H. Yang, and C. Meinel. 2016. Punctuation prediction for unsegmented transcript based on word vector. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 654– 658, Portoroˇz, Slovenia, May. European Language Resources Association (ELRA).
  • Conneau, A., K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzman, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2019. Unsupervised crosslingual representation learning at scale CoRR, abs/1911.02116.
  • Courtland, M., A. Faulkner, and G. McElvain. 2020. Efficient automatic punctuation restoration using bidirectional transformers with robust inference. In Proceedings of the 17th International Conference on Spoken Language Translation, pages 272–279, Online, July. Association for Computational Linguistics.
  • David Vilares, Marcos Garcia, C. G.-R. 2021. Bertinho: Galician bert representations. Procesamiento del Lenguaje Natural, 66(0):13–26.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June. Association for Computational Linguistics.
  • Federico, M., M. Cettolo, L. Bentivogli, M. Paul, and S. St¨uker. 2012. Overview of the IWSLT 2012 evaluation campaign. In Proceedings of the 9th International Workshop on Spoken Language Translation: Evaluation Campaign, pages 12–33, Hong Kong, Table of contents, December 6-7.
  • Gonzalez-Docasal, A., A. Garcıa-Pablos, H. Arzelus, and A. Alvarez. 2021. Autopunct: A bert-based automatic punctuation and capitalisation system for spanish and basque. Procesamiento del Lenguaje Natural, 67(0):59–68.
  • Jones, D., F. Wolf, E. Gibson, E. Williams, E. Fedorenko, D. Reynolds, and M. Zissman. 2003. Measuring the readability of automatic speech-to-text transcripts. In 8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1- 4, 2003. ISCA, 09.
  • Ljubesic, N. and A. Toral. 2014. cawac - a web corpus of catalan and its ap- plication to language modeling and machine translation. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, and S. Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, may. European Language Resources Association (ELRA).
  • Ortiz Suarez, P. J., L. Romary, and B. Sagot. 2020. A monolingual approach to contextualized word embeddings for midresource languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1703–1714, Online, July. Association for Computational Linguistics.
  • Peitz, S., M. Freitag, A. Mauser, and H. Ney. 2011. Modeling punctuation prediction as machine translation. In Proceedings of the 8th International Workshop on Spoken Language Translation: Papers, pages 238–245, 12.
  • Sanh, V., L. Debut, J. Chaumond, and T. Wolf. 2019. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
  • Tiedemann, J. 2012. Parallel data, tools and interfaces in opus. In N. C. C. Chair), K. Choukri, T. Declerck, M. U. Dogan, B. Maegaard, J. Mariani, J. Odijk, and S. Piperidis, editors, Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey, may. European Language Resources Association (ELRA).
  • Tilk, O. and T. Alumae. 2016. Bidirectional recurrent neural network with attention mechanism for punctuation restoration. In INTERSPEECH.
  • Yi, J. and J. Tao. 2019. Self-attention based model for punctuation prediction using word and speech embeddings. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7270–7274.
  • Yi, J., J. Tao, Y. Bai, Z. Tian, and C. Fan. 2020. Adversarial transfer learning for punctuation restoration.
  • Zhu, X., S. Gardiner, D. Rossouw, T. Roldan, and S. Corston-Oliver. 2022. Punctuation restoration in Spanish customer support transcripts using transfer learning. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 80–89, Hybrid, July. Association for Computational Linguistics.