UCL Discovery Stage
UCL home » Library Services » Electronic resources » UCL Discovery Stage

Better Quality Pretraining Data and T5 Models for African Languages

Oladipo, A; Adeyemi, M; Ahia, O; Ogundepo, O; Owodunni, AT; Adelani, DI; Lin, J; (2023) Better Quality Pretraining Data and T5 Models for African Languages. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. (pp. pp. 158-168). Association for Computational Linguistics Green open access

[thumbnail of 2023.emnlp-main.11.pdf]
Preview
PDF
2023.emnlp-main.11.pdf - Published Version

Download (261kB) | Preview

Abstract

In this study, we highlight the importance of enhancing the quality of pretraining data in multilingual language models. Existing web crawls have demonstrated quality issues, particularly in the context of low-resource languages. Consequently, we introduce a new multilingual pretraining corpus for 16 African languages, designed by carefully auditing existing pretraining corpora to understand and rectify prevalent quality issues. To compile this dataset, we undertake a rigorous examination of current data sources for thirteen languages within one of the most extensive multilingual web crawls, mC4, and extract cleaner data through meticulous auditing and improved web crawling strategies. Subsequently, we pretrain a new T5-based model on this dataset and evaluate its performance on multiple downstream tasks. Our model demonstrates better downstream effectiveness over existing pretrained models across four NLP tasks, underscoring the critical role data quality plays in pretraining language models in low-resource scenarios. Specifically, on cross-lingual QA evaluation, our new model is more than twice as effective as multilingual T5. All code, data and model are publicly available at https://github.com/castorini/AfriTeVa-keji.

Type: Proceedings paper
Title: Better Quality Pretraining Data and T5 Models for African Languages
Event: Conference on Empirical Methods in Natural Language Processing 2023
ISBN-13: 9798891760608
Open access status: An open access version is available from UCL Discovery
Publisher version: https://aclanthology.org/2023.emnlp-main.11.pdf
Language: English
Additional information: This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
UCL classification: UCL
UCL > Provost and Vice Provost Offices > UCL BEAMS
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science
UCL > Provost and Vice Provost Offices > UCL BEAMS > Faculty of Engineering Science > Dept of Computer Science
URI: https://discovery-pp.ucl.ac.uk/id/eprint/10188837
Downloads since deposit
110Downloads
Download activity - last month
Download activity - last 12 months
Downloads by country - last 12 months

Archive Staff Only

View Item View Item