BOTS-LM: Training Large Language Models for Setswana (2024)

Nathan Brown
School of Computing
Clemson University
nbrown9@clemson.edu
Vukosi Marivate
Department of Computer Science
University of Pretoria
vukosi.marivate@cs.up.ac.za

Abstract

In this work we present BOTS-LM, a series of bilingual language models proficient in both Setswana and English. Leveraging recent advancements in data availability and efficient fine-tuning, BOTS-LM achieves performance similar to models significantly larger than itself while maintaining computational efficiency. Our initial release features an 8 billion parameter generative large language model, with upcoming 0.5 billion and 1 billion parameter large language models and a 278 million parameter encoder-only model soon to be released. We find the 8 billion parameter model significantly outperforms Llama-3-70B and Aya 23 on English-Setswana translation tasks, approaching the performance of dedicated machine translation models, while approaching 70B parameter performance on Setswana reasoning as measured by a machine translated subset of the MMLU benchmark. To accompany the BOTS-LM series of language models, we release the largest Setswana web dataset, SetsText, totalling over 267 million tokens. In addition, we release the largest machine translated Setswana dataset, the first and largest synthetic Setswana dataset, training and evaluation code, training logs, and MMLU-tsn, a machine translated subset of MMLU.

1 Introduction

Setswana, also known as Tswana, is a Bantu language spoken by an estimated five to ten million people worldwide [Bennett etal., 2016]. Closely related to Northern Sotho and Southern Sotho, Setswana holds official status in Botswana [Government of Botswana, 2024], South Africa [The Republic of South Africa, 1996], and Zimbabwe [The Parliament of Zimbabwe, 2013], and is also used in countries like Namibia, often interchangeably with English. Despite its significance in millions of lives, Setswana has been largely overlooked in traditional natural language processing research. This work aims to bridge the gap between Setswana and other high-resource languages by making generative large language models capable of high-quality Setswana available to the open research community for the first time, significantly increasing data availability, and laying the groundwork for future Setswana-centric research.

Large Language Models (LLMs) have demonstrated remarkable capabilities across various domains after training on web and synthetic data [OpenAI etal., 2023, Dubey etal., 2024, Anthropic, 2024, Gunasekar etal., 2023], excelling in areas such as mathematics [Mistral AI, 2024], programming [Guo etal., 2024], creative writing [Wang etal., 2024], and translation tasks [Vaswani etal., 2017]. However, many developments surrounding this technology remain English-centric, with most non-English languages targeted for training and evaluation being "high-resource languages" with abundant available data. While this approach yields impressive capabilities, it results in models that may lack knowledge of certain cultures, limits production use-cases outside majority demographics, and prevents a significant portion of the global population from utilizing language models effectively. African languages like Setswana, with comparatively little webtext data available, subsequently suffer in performance and are underutilized in research.

Recent progress has been made to address the lack of language diversity in language models. Releases such as mBART [Liu etal., 2020], XLM-RoBERTa [Conneau etal., 2019], and BLOOM [BigScience Workshop etal., 2022] were among the earliest and most influential advancements in multilingual NLP. Building upon these technologies, newer models like GPT-4 [OpenAI etal., 2023], Claude [Anthropic, 2024], Gemini [Gemini Team etal., 2023], Llama [Dubey etal., 2024], and Gemma [Gemma Team etal., 2024] have also found success in multilingual domains, often demonstrating reasoning and translation capabilities in languages not officially supported. Releases such as Aya 101 [Üstün etal., 2024] and Aya 23 [Aryabumi etal., 2024] have continued to improve language coverage while maintaining high performance in generative models, and open corpora like ROOTS [Laurençon etal., 2023], OSCAR [Suárez etal., 2019], and mc4 [Caswell etal., 2021] have made multilingual data readily available. However, Setswana comprises only a small fraction of these datasets, with just 0.0002% of the ROOTS corpus being written in Setswana. Moreover, much of the available Setswana text in open multilingual corpora is of lower quality, resulting in existing generative models displaying significantly worse conversational, translation, and reasoning capabilities compared to other languages.

To help address this issue, we introduce the Bilingual Open Tswana Suite of Language Models (BOTS-LM). This suite consists of Quantized LoRA (QLoRA) fine-tuned versions of Qwen2 0.5B Yang etal. [2024], DCLM 1B Li etal. [2024], Llama 3 8B Dubey etal. [2024], and AfroXLMR-base (279M) [Alabi etal., 2022]. By training a range of models across different parameter counts, we aim to provide the research community and millions of Setswana speakers with models that are not only highly performant, but also capable of running on various hardware configurations ranging from data centers to consumer laptops and mobile phones. Following the approach of projects like OMLo [Groeneveld etal., 2024] and Dolma [Soldaini etal., 2024], we release all training and evaluation data, associated code, and training logs. We release a preliminary training checkpoint of our 8B model, with additional models being released soon. The 8B model and all future BOTS-LM models can be accessed here, training data can be accessed here, and the training Weights & Biases logs can be accessed here. Additionally, we utilize GPT-4o to generate over 1.5 million tokens of synthetic Setswana text, which we release here. Last, we release the code used for dataset curation, model training, and model evaluation here.

2 Related Work

Despite the focus on other languages in research, there have been recent significant advancements in Setswana-centric NLP. TswanaBERT [Motsoehli, 2020] represents one of the earliest examples, trained on over ten thousand Setswana sentences from the Leipzig Corpora Collection [Goldhahn etal., 2012], SABC news headlines [Marivate etal., 2020], and various blogs and websites. More recently, the NCHLT Setswana RoBERTa model [Eiselen etal., 2023] was released, having been trained on over fourteen million tokens of Setswana text from the NCHLT [Eiselen and Puttkammer, 2014], Autshumato [McKellar etal., 2016], Leipzig [Goldhahn etal., 2012], and Common Crawl corpora, and an internal CTexT corpus. PuoBERTa [Marivate etal., 2023] marked a significant step forward in masked language modeling, achieving state of the art performance while being the first model targeted at Setswana trained from scratch with a custom tokenizer developed for Setswana. PuoBERTa was released alongside PuoData, the largest collection of primarily Setswana hand-curated text at the time, totaling four and a half million PuoBERTa tokens excluding the JW300 Setswana dataset [Agić and Vulić, 2019].

Much of the recent work on African languages has targeted multilingual developments with a focus on machine translation and transfer learning. Corpora such as OPUS [Tiedemann, 2012] and MADLAD-400 [Kudugunta etal., 2023] provide access to large volumes of parallel web text data, while datasets like MAFAND-MT [Adelani etal., 2022] have enabled improved translation performance across many languages through additional human-generated data. Meta’s No Language Left Behind (NLLB) [NLLB Team etal., 2022] has facilitated high-quality machine translation between over 200 languages, including Setswana, although it is noted as being one of the 21 languages with the lowest accuracy on FLORES-200 [Goyal etal., 2022]. MADLAD-400 [Kudugunta etal., 2023] has allowed for increased multilinguality but suffers from worse performance on Setswana-English translations, as seen in Table 3. Furthermore, models such as AfriBERTa [Ogueji etal., 2021] and AfroXLMR [Alabi etal., 2022] have found success in training across multiple African and low-resource languages in the masked language modeling regime, and LlamaX [Lu etal., 2024] has demonstrated high degrees of translation performance across over 100 languages as a generative LLM while retaining generalization capabilities.

Thanks to the cumulative improvements made by the research community in Setswana-centric NLP and increased levels of multilinguality, we believe there are significant opportunities to enhance existing Setswana NLP systems. Several smaller datasets of Setswana text are often underutilized in the literature, potentially due to their differing formats and distribution methods. For example, the South African Center for Digital Language Resources (SADiLaR)¹¹1https://sadilar.org[SADiLaR, 2024] has made publicly available several datasets of Setswana text and audio. However, as the format for this distributed text is non-standardized and data is typically not available on commonly used external platforms such as Hugging Face, much of this data is left out of massive public datasets. We also find many websites and corresponding PDF contents written in Setswana are excluded from existing datasets, meaning a significant portion of the available high-quality Setswana text is not being utilized. Additionally, we find the advent of improved machine translation and large language models capable of generating text in Setswana allows for the creation of synthetic datasets and translated benchmarks to further improve and better gauge downstream performance on various tasks.

3 Data

A key focus in the design of our selected training data is the languages to be targeted. While several parallel corpora are available for Setswana, with some containing Setswana text alongside multiple other languages, there is significantly more English-Setswana parallel text available. For example, OPUS contains over four times more parallel Setswana-English sentences than it does Setswana-French sentences²²2https://opus.nlpl.eu/results/tn&en/corpus-result-table. In addition, the majority of publicly available, high-quality, educational, and instruction-tuning datasets are written in English. Consequently, many state-of-the-art large language models excel in English tasks, including the models we select for continued pre-training. Additionally, South Africa and Botswana, the two countries with the largest Setswana-speaking populations, often utilize English in legal, official, and government documents, with English sometimes being spoken interchangeably with Setswana. We observe this trend in web text documents, where many data sources are written in both English and Setswana, either incorporating code-switching or providing direct translations. As such, and to reduce computational demands, we focus our efforts primarily on curating a high-quality dataset consisting of Setswana and English texts, though other languages may be present. We refer to this training dataset as SetsText.

SetsText builds upon several previous works in the African NLP research community. For consistency, when measuring token counts, we utilize the Llama 3 tokenizer throughout this paper. The most significant source of high-quality Setswana data in SetsText is PuoData [Marivate etal., 2023], which consists of over eight million tokens sourced from various government documents, books, educational materials, and other online content, including the NCHLT [Eiselen and Puttkammer, 2014], Leipzig [Goldhahn etal., 2012], and SABC [Marivate etal., 2020] corpora. However, PuoData was originally curated for training PuoData on masked language modeling, resulting in training sequences of individual sentences instead of entire documents. To ensure our data is in a format most useful for LLM training, we reconstruct the Wikipedia³³3https://huggingface.co/datasets/wikimedia/wikipedia, Nalibali⁴⁴4https://nalibali.org/, and Setswana Bible⁵⁵5https://www.bible.com/versions/185-tsw70-setswana-1970 subsets of PuoData to retain the original documents in their entirety. The second largest source of Setswana text we utilize is the OPUS corpus [Tiedemann, 2012], which includes over six million parallel English-Setswana sentences. This subset of OPUS comprises the NLLB [Schwenk etal., 2021, Fan etal., 2021], CCAligned [El-Kishky etal., 2020], XLEnt [El-Kishky etal., 2021], Wikimedia⁶⁶6https://opus.nlpl.eu/wikimedia/tn&en/v20230407/wikimedia, and Tatoeba⁷⁷7https://tatoeba.org/en/[Tatoeba, 2024] corpora. However, we find that training on the entire English-Setswana subset of OPUS often yields models which suffer from catastrophic forgetting, likely due to the sheer number of sequences present in the dataset as well as lower overall quality. To mitigate this issue, we include only one-third of this data in SetsText, totaling 210M tokens.

To further improve BOTS-LM’s generalization capabilities and to make large swaths of Setswana data more easily available to researchers, we expand some existing data sources through additional data collection and hand-curate data from new publicly available sources. We begin by obtaining PDF data from the Parliament of Botswana’s Hansard⁸⁸8https://www.parliament.gov.bw/index.php?option=com_documents&view=files&catid=87&Itemid=438[Parliament of Botswana, 2022], a collection of documents written in English and Setswana containing the "printed verbatim record of all Parliamentary debates and proceedings which take place in the Chamber of the National Assembly." Additionally, we source all texts from the Rare Tswana Books collection⁹⁹9https://wiredspace-extra.wits.ac.za/collections/286d8e41-cb6e-4679-9ed7-97a4e3d0d3de[Rahlao etal., 2021], which consists of rare African books written in Setswana dating back to the 1800s. We also obtain additional PDF data from Nalibali that was not present in PuoData, nearly doubling the number of available tokens in that subset. Furthermore, we scrape the contents of certain websites containing predominantly Setswana text, such as iAfrika¹⁰¹⁰10https://iafrika.org/[iAfrika, 2024], Setswana Mo Botswana¹¹¹¹11https://setswanabwa.com/[TRK, 2021], Tlhalefang Communications¹²¹²12https://tlhalefang.com/setswana/[Tlhalefang, 2009], and Unisa¹³¹³13https://www.unisa.ac.za/sites/corporate/default/Unisa-Open/OER-@-Unisa/Learn-to-speak-an-African-Language[Unisa, 2023]. We also include a small corpus of parallel text consisting of monolingual English mathematical text translated into code-mixed English and Setswana [Mokoka, 2024]. Finally, we include miscellaneous individual documents such as the United States Peace Corps’ Intro to Spoken Setswana¹⁴¹⁴14https://files.eric.ed.gov/fulltext/ED283381.pdf[Mistry and Gare, 1987], Peace Corps Setswana language lessons on Live Lingua¹⁵¹⁵15https://www.livelingua.com/courses/setswana[Lingua, 2024], and the Setswana Universal Declaration of Human Rights¹⁶¹⁶16https://www.ohchr.org/en/human-rights/universal-declaration/translations/western-sothotswanasetswana[Nations, 1998]. For the majority of PDFs we extract text using the pypdfium2 Python library. However, while the Peace Corps’ Intro to Spoken Setswana is one of the most comprehensive public teaching materials on the Setswana language, the respective text is not embedded into the published PDF, nor is the text available in any other form to our knowledge. Due to the lower quality scan and the occasionallly complex formatting, we also found existing OCR systems performed rather poorly at properly extracting the text. To better improve the quality of this text prior to its inclusion in SetsText, we convert each page to an image, perform image processing to remove noise, then utilize Florence-2 [Xiao etal., 2023] Large to extract the text from each image. This text is then sent to Llama-3-70B-Instruct, which adjusts the text in an attempt to restore the original formatting.

To further increase the size of our collective dataset, we employ machine translation to translate existing high-quality English datasets into Setswana. Specifically, we use Meta’s 3.3B parameter No Language Left Behind translation model (NLLB-200-3.3B) [NLLB Team etal., 2022] to translate portions of two English datasets to Setswana. We translate 20,000 sequences from the TinyStories dataset [Eldan and Li, 2023], which has shown success in training small language models and which we found via round-trip translation evaluations [Moon etal., 2020] to often yield high-quality translations. To ensure our models can effectively follow instructions and interact as assistants, we translate 15,201 conversations from OpenHermes-2.5 [Teknium, 2023], a corpus consisting of several open instruction-tuning datasets. However, as most of these conversations are single-turn, we also translate 5,201 conversations from Wildchat-1M [Zhao etal., 2024], a collection of one million conversations between human users and ChatGPT. We filter for conversations longer than three utterances and exclude conversations labeled as toxic or in any language other than English. To further augment our instruction tuning dataset, we utilize the translations in the training split of the MAFAND-MT dataset [Adelani etal., 2022], which consists of several human-translated parallel English-Setswana sentences from news publications. We format each translation into a user-assistant interaction, randomizing source language, target language, query language, and formatting of the corresponding question and response to increase prompt diversity.

While we find specialized machine translation models such as NLLB-200 [NLLB Team etal., 2022] and MADLAD-400 Kudugunta etal. [2023] useful for translating simpler texts, we note two problems with this methodology. First, while these models can achieve state-of-the-art translations, they typically do not preserve the formatting of the original text. Additionally, when provided with text that may be outside the training distribution, which we find to be the case with especially noisy text or text with increased technical jargon, these models may repeat the input text verbatim rather than generating a proper translation. To obtain additional high-quality, well-formatted diverse Setswana text while avoiding the problem of verbatim repetition, we utilize gpt-4o-2024-05-13 [OpenAI, 2024] to generate synthetic Setswana text. To achieve data diversity between samples, we use existing web data as a seed. First, a random subset of FineWeb and FineWeb-edu [Penedo etal., 2024] is selected. Then, given a sequence from one of these datasets, Llama-3-70B is prompted to determine five new pieces of writing that cover similar topics. For example, given a web document discussing the mathematics behind gravity, Llama 3 may suggest "A children’s book explaining the story of an apple landing on Isaac Newton’s head." Additionally, Llama 3 is prompted to select the 5 most unique words present in the provided sequence. We then cross-reference the Setswana-English split of Google Research’s GATITOS¹⁷¹⁷17https://github.com/google-research/url-nlp/tree/main/gatitosGoogle Research [2024], a human-translated lexicon, and filter to only include instances where Llama 3 selected at least one English word which is present in GATITOS. Finally, prompts are constructed to be provided to GPT-4 consisting of: 1) A system prompt to instruct the model to write in Setswana, 2) One of the five pieces of writing, randomly selected, and 3) Each English word selected by Llama 3 which is present in GATITOS as well as its Setswana translation. We find that utilizing this methodology allows GPT-4o to generate high-quality Setswana texts while maintaining a high degree of diversity between texts.

We release a preliminary version of SetsText in its entirety on Hugging Face¹⁸¹⁸18https://huggingface.co/datasets/OxxoCodes/SetsText. To assist researchers in future potential analysis and deduplication work, we provide alongside each sequence information regarding its source, including the original corpus or other category and an exact source URL where available. A full breakdown of the SetsText dataset, including data sources and token counts, is made available in Table 1.

Source	Tokens	Source	Tokens
OPUS (NLLB)	208,344,324	BW 2018 Leipzig	434,011
TinyStories-tsn	13,328,020	OPUS (Wikimedia)	429,552
MADLAD-400	10,952,546	SABC Dikgang	358,844
WildChat-tsn	10,533,874	MAFAND-MT	328,281
Rare Tswana Books	4,971,441	Tlhalefang	268,115
BW Hansard	4,223,058	Vukuzenzele	232,012
NCHLT 2019	1,990,720	Nalibali	205,657
Bible	1,742,663	Miscellaneous	193,597
OPUS (CCAligned)	1,664,478	Setswanabwa	180,042
Department Basic Education	1,628,799	ZA Constitution	140,479
Wikipedia	1,519,214	iAfrika	136,673
ZA Gov Cabinet Speeches	1,311,903	OPUS (XLEnt)	109,625
GlotCC-v1	941,585	SABC Motsweding	66,459
ZA 2020 Leipzig	471,773	Math Code-Switch	4,710
Leipzig Wiki	454,660	OPUS (Tatoeba)	930

4 Training

Training is performed utilizing the Hugging Face transformers, trl, and peft libraries. Our approach involves performing continued pre-training for each model to ensure robust learning without encountering catastrophic forgetting. As opposed to performing continued fine-tuning on a base model, then performing instruction fine-tuning, we train models which have already been finetuned on instruction-following data. By utilizing a mixture of raw web text and instruction data in our training dataset, we yield models which are capable of instruction-following without the additional compute required for a separate instruction fine-tuning stage. In addition, to retain English generalization capabilities, we include a 15% misture of English FineWeb Penedo etal. [2024], OpenHermes-2.5 Teknium [2023], and Glaive Code¹⁹¹⁹19https://huggingface.co/datasets/glaiveai/glaive-code-assistant[Glaive AI, 2023].

Furthermore, we train using Quantized Low Rank Adaptation (QLoRA) across the Query, Key, Value, Output, Gate, Up, and Down projection matrices. This technique not only allows for effective learning without significant computational and memory requirements, but also mitigates the risk of overfitting and catastrophic forgetting. To ensure the models can effectively adapt to the Setswana langauge, we train the embedding layers and the language modeling head in full precision.

All models are trained using DeepSpeed [Rasley etal., 2020] and ZeRO Stage 3 Rajbhandari etal. [2020] on two NVIDIA A100 80GB GPUs. We provide the hyperparameters utilized during training in Table 2.

Parameter	Value
Max Seq Length	2048 tokens
LoRA Alpha	32
LoRA Dropout	0.05
LoRA Rank	64
Bias	None
Precision	bf16
Optimizer	AdamW 8bit
Weight Decay	0.0
Learning Rate	6e-05
LR Scheduler	Cosine
Epochs	3
Packing	✓
Per-Device Batch Size	2
Gradient Accumulation Steps	8
Effective Batch Size	32 (64k tokens)

5 Evaluation

In this section we discuss the steps taken to evaluate the BOTS-LM suite of models. We place a large emphasis on translation performance in this initial release, as this is a common area of focus in the literature. In future releases we will integrate additional evaluations such as news and document classification.

To evaluate translation performance, we evaluate on the MAFAND-MT [Adelani etal., 2022], Lego-MT [Yuan etal., 2023], and Flores-200 [NLLB Team etal., 2022] benchmarks. These benchmarks cover a variety of translation domains, ranging from news articles to miscellaneous crawled web documents. To measure translation performance numerically, we utilize the BLEU [Papineni etal., 2002] and CHRF Popović [2015] measurements. We indicate translating from English to Setswana with en-tn, and translating from Setswana to English with tn-en. Our results are presented in Table 3.

To further evaluate the general performance of BOTS-LM, as well as other language models, on general reasoning in Setswana, we develop MMLU-tsn. This is a machine translated NLLB Team etal. [2022] subset of approximately 1,000 questions and answers from the Massive Multitask Language Understanding (MMLU) [Hendrycks etal., 2021b, a] benchmark validation split. A translation of the test split is in development. This benchmark measures multiple-choice answering capabilities of language models on various topics such as humanities, social science, and STEM. We acknowledge this method of relying on machine translation systems to translate this text, especially given its technical nature, is likely to suffer from "translationese" and other translation errors which can impact a model’s performance on the benchmark [Plaza etal., 2024]. As such, many sequences may be incorrectly translated, biased, or subsequently impossible to solve. However, we find this translated benchmark to still be a useful proxy for a model’s performance in downstream reasoning performance in Setswana. As open models become more capable of Setswana and minute differences in scores become more important in evaluating performance, it will be necessary for benchmarks to be either human-translated or designed from the ground-up for Setswana.

ModelMMLUMAFAND-MTLego-MTFLORES-200en-tntn-enen-tntn-enen-tntn-enCHRFBLEUCHRFBLEUCHRFBLEUCHRFBLEUCHRFBLEUCHRFBLEULlama 3 Instruct (8B)35.0422.902.6029.525.7311.021.6911.611.7020.782.3727.044.15Llama-3 Instruct (70B)47.5434.977.2637.336.4211.370.5811.930.7531.475.5033.105.07Aya 23 (8B)28.4114.850.7417.171.828.490.949.521.2514.190.8616.561.33Aya 23 (35B)36.7414.950.885.9628.888.370.8910.161.1913.050.6327.095.13LlamaX3 (8B)34.0923.392.0027.555.6211.571.2913.211.9821.511.6425.124.14NLLB-200 (3.3B)57.6428.1546.9920.6623.7613.5317.104.3650.1521.7341.3016.16MADLAD-400 MT (10B)22.066.0734.8614.0616.859.6622.3314.6019.393.5131.4011.14BOTS-LM (8B)42.4254.1026.1022.943.4314.962.3011.131.5145.8018.4221.302.73

We find our 8B BOTS-LM model excels in particular at English-Setswana translations. It exceeds the translation performance of the larger Aya 23 35B and the massively multilingual LlamaX3, even exceeding Llama models nearly nine times its size. When translating from English to Setswana, it reaches performance levels competitive with dedicated machine translation systems, on the MAFAND-MT benchmark exceeding MADLAD-400 MT and approaching the performance of NLLB-200. However, we do note worsened performance when translating from Setswna back to English, likely due to imbalances in the training dataset. When evaluated on multiple-choice Setswana questions in the MMLU-tsn benchmark, BOTS-LM significantly outperforms the original Llama-3-8B-Instruct model, only being beaten by the Llama-3-70B-Instruct model. This indicates that, while the 70B model may be suffer from worse Setswana translation and writing skills, it still retains a degree of understanding and reasoning capabilities.

Our 8B BOTS-LM model demonstrates exceptional performance in English-Setswana translations, surpassing larger multilingual models such as Aya 23 35B and the massively multilingual LlamaX3, even outperforming Llama models nearly nine times its size. In English to Setswana translation, BOTS-LM achieves results competitive with specialized machine translation systems, surpassing MADLAD-400 MT on the MAFAND-MT benchmark. However, we observe performance degredation in Setswana to English ttranslations, likely due to training dataset imabalances. On the MMLU-tsn benchmark of multiple choice Setswana questions, BOTS-LM significantly outperforms the original Llama-3-8B-Instruct model, only falling short of its 70B variant. This suggests that, while the 70B model may have infereior Setswana translation and writing skills, it retains superior understanding and reasoning capabilities thanks to its size. These results highlight the effectiveness of our fine-tuning approach on the Settswana corpus, particularly in enhancing translation and language understanding tasks, and demonstrates the potential for additional training to yield significant improvements.

6 Conclusion

In this work we introduce the first release in the BOTS-LM series of bilingual language models tailored for Setswana and English. Our models demonstrate significantly improved performance in English-Setswana translation tasks, rivaling models with significantly larger parameter counts. We introduce SetsText, the largest curated dataset of primarily Setswana text to date. By leveraging efficient fine-tuning techniques, we help to bridge the gap between Setswana and other high-resource languages, laying the groundwork for future enhanced Setswana-centric NLP. Our results indicate that BOTS-LM not only excels in translation but also demonstrates promising capabilities in reasoning and understanding tasks in Setswana. Future work will focus on expanding the suite with additional models and evaluations, ultimately aiming to bring small, truly open, and high-performance language models to the research community and the millions of Setswana speakers globally.

7 Acknowledgements

We would like to extend our gratitude to the OpenAI team for their invaluable support and for granting us the opportunity to utilize their models. Our appreciation also goes to Trelis Research for their generous financial backing. Additionally, we are deeply thankful to Dr. Jacob Sorber and Professor Carrie Russell of Clemson University, and Dr. Srinath Doss of Botho University. This work would not be possible without your guidance and support.

References

Adelani etal. [2022]David Adelani, Jesujoba Alabi, Angela Fan, Julia Kreutzer, Xiaoyu Shen, Machel Reid, Dana Ruiter, Dietrich Klakow, Peter Nabende, Ernie Chang, Tajuddeen Gwadabe, Freshia Sackey, Bonaventure F.P. Dossou, Chris Emezue, Colin Leong, Michael Beukman, Shamsuddeen Muhammad, Guyo Jarso, Oreen Yousuf, Andre NiyongaboRubungo, Gilles Hacheme, EricPeter Wairagala, MuhammadUmair Nasir, Benjamin Ajibade, Tunde Ajayi, Yvonne Gitau, Jade Abbott, Mohamed Ahmed, Millicent Ochieng, Anuoluwapo Aremu, Perez Ogayo, Jonathan Mukiibi, Fatoumata OuobaKabore, Godson Kalipe, Derguene Mbaye, AllahseraAuguste Tapo, Victoire MemdjokamKoagne, Edwin Munkoh-Buabeng, Valencia Wagner, Idris Abdulmumin, Ayodele Awokoya, Happy Buzaaba, Blessing Sibanda, Andiswa Bukula, and Sam Manthalu.A few thousand translations go a long way! leveraging pre-trained models for African news translation.In Marine Carpuat, Marie-Catherine deMarneffe, and IvanVladimir MezaRuiz, editors, Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3053–3070, Seattle, United States, July 2022. Association for Computational Linguistics.doi: 10.18653/v1/2022.naacl-main.223.URL https://aclanthology.org/2022.naacl-main.223.
Agić and Vulić [2019]Željko Agić and Ivan Vulić.JW300: A wide-coverage parallel corpus for low-resource languages.In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3204–3210, Florence, Italy, July 2019. Association for Computational Linguistics.doi: 10.18653/v1/P19-1310.URL https://aclanthology.org/P19-1310.
Alabi etal. [2022]JesujobaO. Alabi, DavidIfeoluwa Adelani, Marius Mosbach, and Dietrich Klakow.Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning.In Proceedings of the 29th International Conference on Computational Linguistics, pages 4336–4349, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics.URL https://aclanthology.org/2022.coling-1.382.
Anthropic [2024]Anthropic.Claude 3.5 sonnet, 2024.URL https://www.anthropic.com/news/claude-3-5-sonnet.
Aryabumi etal. [2024]Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, JonAnder Campos, YiChern Tan, Kelly Marchisio, Max Bartolo, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker.Aya 23: Open weight releases to further multilingual progress, 2024.
Bennett etal. [2016]Wm.G. Bennett, Maxine Diemer, Justine Kerford, Tracy Probert, and Tsholofelo Wesi.Setswana (south african).Journal of the International Phonetic Association, 46(2):235–246, 2016.doi: 10.1017/S0025100316000050.
BigScience Workshop etal. [2022]BigScience Workshop, :, TevenLe Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, AlexandraSasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, AlexanderM. Rush, Stella Biderman, Albert Webson, PawanSasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, AlbertVillanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, IzBeltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, PedroOrtiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, AlhamFikri Aji, Amit Alfassy, Anna Rogers, ArielKreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, DavidIfeoluwa Adelani, Dragomir Radev, EduardoGonzález Ponferrada, Efrat Levkovizh, Ethan Kim, EyalBar Natan, FrancescoDe Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran,Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier dela Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, LeandroVon Werra, Leon Weber, Long Phan, LoubnaBen allal, Ludovic Tanguy, Manan Dey, ManuelRomero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, MinhChien Vu, MohammadA. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona deGibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, RobertoLuis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, ShamsuddeenHassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, TiagoTimponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina,Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, DavutEmre Taşar, Elizabeth Salesky, SabrinaJ. Mielke, WilsonY. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harsh*t Pandey, Hendrik Strobelt, JasonAlan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, MSaiful Bari, MagedS. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, StephenH. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, HyungWon Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick vonPlaten, Pierre Cornette, PierreFrançois Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, GentaIndra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, JessicaZosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar vander Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, AntigonaUnldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, CarlosMuñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, DuongA. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, JulioBonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, AbhinavRamesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, BoWang, Caio Brito,Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, DanielLeón Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, HelenaU. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, JoseDavid Posada, KarthikRangasai Sivaraman, Lokesh Bulchandani, LuLiu, Luisa Shinzato, MadeleineHahn deBykhovetz, Maiko Takeuchi, Marc Pàmies, MariaA Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, MichielDe Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, NicholasMichio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, ShlokS Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, TomoyaKainuma, Wojciech Kusa, Yanis Labrak, YashShailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, YuXu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf.Bloom: A 176b-parameter open-access multilingual language model, 2022.
Caswell etal. [2021]Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, PedroJavier Ortiz Suárez, Iroro Orife, Kelechi Ogueji, RubungoAndre Niyongabo, ToanQ. Nguyen, Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F.P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, andMofetoluwa Adeyemi.Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets.arXiv e-prints, art. arXiv:2103.12028, March 2021.
Conneau etal. [2019]Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov.Unsupervised cross-lingual representation learning at scale, 2019.
Dubey etal. [2024]Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, CristianCanton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livsh*ts, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, EricMichael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, GeorgiaLewis Anderson, Graeme Nail, Gregoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, HuXu, Hugo Touvron, Iliyan Zarov,ImanolArrieta Ibarra, Isabel Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer vander Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, KalyanVasuden Alwala, Kartikeya Upasani, Kate Plawiak, KeLi, Kenneth Heafield, Kevin Stone, Khalid El-Arini, Krithika Iyer, Ksh*tiz Malik, Kuenley Chiu, Kunal Bhalla, Lauren Rantala-Yeary, Laurens vander Maaten, Lawrence Chen, Liang Tan, Liz Jenkins, Louis Martin, Lovish Madaan, Lubo Malo, Lukas Blecher, Lukas Landzaat, Luke deOliveira, Madeline Muzzi, Mahesh Pasupuleti, Mannat Singh, Manohar Paluri, Marcin Kardas, Mathew Oldham, Mathieu Rita, Maya Pavlova, Melanie Kambadur, Mike Lewis, Min Si, MiteshKumar Singh, Mona Hassan, Naman Goyal, Narjes Torabi, Nikolay Bashlykov, Nikolay Bogoychev, Niladri Chatterji, Olivierduch*enne, Onur Çelebi, Patrick Alrassy, Pengchuan Zhang, Pengwei Li, Petar Vasic, Peter Weng, Prajjwal Bhargava, Pratik Dubal, Praveen Krishnan, PunitSingh Koura, Puxin Xu, Qing He, Qingxiao Dong, Ragavan Srinivasan, Raj Ganapathy, Ramon Calderer, RicardoSilveira Cabral, Robert Stojnic, Roberta Raileanu, Rohit Girdhar, Rohit Patel, Romain Sauvestre, Ronnie Polidoro, Roshan Sumbaly, Ross Taylor, Ruan Silva, Rui Hou, Rui Wang, Saghar Hosseini, Sahana Chennabasappa, Sanjay Singh, Sean Bell, SeohyunSonia Kim, Sergey Edunov, Shaoliang Nie, Sharan Narang, Sharath Raparthy, Sheng Shen, Shengye Wan, Shruti Bhosale, Shun Zhang, Simon Vandenhende, Soumya Batra, Spencer Whitman, Sten Sootla, Stephane Collot, Suchin Gururangan, Sydney Borodinsky, Tamar Herman, Tara Fowler, Tarek Sheasha, Thomas Georgiou, Thomas Scialom, Tobias Speckbacher, Todor Mihaylov, Tong Xiao, Ujjwal Karn, Vedanuj Goswami, Vibhor Gupta, Vignesh Ramanathan, Viktor Kerkez, Vincent Gonguet, Virginie Do, Vish Vogeti, Vladan Petrovic, Weiwei Chu,Wenhan Xiong, Wenyin Fu, Whitney Meers, Xavier Martinet, Xiaodong Wang, XiaoqingEllen Tan, Xinfeng Xie, Xuchao Jia, Xuewei Wang, Yaelle Goldschlag, Yashesh Gaur, Yasmine Babaei, YiWen, Yiwen Song, Yuchen Zhang, Yue Li, Yuning Mao, ZacharieDelpierre Coudert, Zheng Yan, Zhengxing Chen, Zoe Papakipos, Aaditya Singh, Aaron Grattafiori, Abha Jain, Adam Kelsey, Adam Shajnfeld, Adithya Gangidi, Adolfo Victoria, Ahuva Goldstand, Ajay Menon, Ajay Sharma, Alex Boesenberg, Alex Vaughan, Alexei Baevski, Allie Feinstein, Amanda Kallet, Amit Sangani, Anam Yunus, Andrei Lupu, Andres Alvarado, Andrew Caples, Andrew Gu, Andrew Ho, Andrew Poulton, Andrew Ryan, Ankit Ramchandani, Annie Franco, Aparajita Saraf, Arkabandhu Chowdhury, Ashley Gabriel, Ashwin Bharambe, Assaf Eisenman, Azadeh Yazdan, Beau James, Ben Maurer, Benjamin Leonhardi, Bernie Huang, Beth Loyd, BetoDe Paola, Bhargavi Paranjape, Bing Liu, BoWu, Boyu Ni, Braden Hanco*ck, Bram Wasti, Brandon Spence, Brani Stojkovic, Brian Gamido, Britt Montalvo, CarlParker, Carly Burton, Catalina Mejia, Changhan Wang, Changkyu Kim, Chao Zhou, Chester Hu, Ching-Hsiang Chu, Chris Cai, Chris Tindal, Christoph Feichtenhofer, Damon Civin, Dana Beaty, Daniel Kreymer, Daniel Li, Danny Wyatt, David Adkins, David Xu, Davide Testuggine, Delia David, Devi Parikh, Diana Liskovich, Didem Foss, Dingkang Wang, Duc Le, Dustin Holland, Edward Dowling, Eissa Jamil, Elaine Montgomery, Eleonora Presani, Emily Hahn, Emily Wood, Erik Brinkman, Esteban Arcaute, Evan Dunbar, Evan Smothers, Fei Sun, Felix Kreuk, Feng Tian, Firat Ozgenel, Francesco Caggioni, Francisco Guzmán, Frank Kanayet, Frank Seide, GabrielaMedina Florez, Gabriella Schwarz, Gada Badeer, Georgia Swee, Gil Halpern, Govind Thattai, Grant Herman, Grigory Sizov, Guangyi, Zhang, Guna Lakshminarayanan, Hamid Shojanazeri, Han Zou, Hannah Wang, Hanwen Zha, Haroun Habeeb, Harrison Rudolph, Helen Suk, Henry Aspegren, Hunter Goldman, Igor Molybog, Igor Tufanov, Irina-Elena Veliche, Itai Gat, Jake Weissman, James Geboski, James Kohli,Japhet Asher, Jean-Baptiste Gaya, Jeff Marcus, Jeff Tang, Jennifer Chan, Jenny Zhen, Jeremy Reizenstein, Jeremy Teboul, Jessica Zhong, Jian Jin, Jingyi Yang, Joe Cummings, Jon Carvill, Jon Shepard, Jonathan McPhie, Jonathan Torres, Josh Ginsburg, Junjie Wang, Kai Wu, KamHou U, Karan Saxena, Karthik Prasad, Kartikay Khandelwal, Katayoun Zand, Kathy Matosich, Kaushik Veeraraghavan, Kelly Michelena, Keqian Li, Kun Huang, Kunal Chawla, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, Lavender A, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, Licheng Yu, Liron Moshkovich, Luca Wehrstedt, Madian Khabsa, Manav Avalani, Manish Bhatt, Maria Tsimpoukelli, Martynas Mankus, Matan Hasson, Matthew Lennie, Matthias Reso, Maxim Groshev, Maxim Naumov, Maya Lathi, Meghan Keneally, MichaelL. Seltzer, Michal Valko, Michelle Restrepo, Mihir Patel, Mik Vyatskov, Mikayel Samvelyan, Mike Clark, Mike Macey, Mike Wang, MiquelJubert Hermoso, MoMetanat, Mohammad Rastegari, Munish Bansal, Nandhini Santhanam, NataschaParks, Natasha White, Navyata Bawa, Nayan Singhal, Nick Egebo, Nicolas Usunier, NikolayPavlovich Laptev, Ning Dong, Ning Zhang, Norman Cheng, Oleg Chernoguz, Olivia Hart, Omkar Salpekar, Ozlem Kalinli, Parkin Kent, Parth Parekh, Paul Saab, Pavan Balaji, Pedro Rittner, Philip Bontrager, Pierre Roux, Piotr Dollar, Polina Zvyagina, Prashant Ratanchandani, Pritish Yuvraj, Qian Liang, Rachad Alao, Rachel Rodriguez, Rafi Ayub, Raghotham Murthy, Raghu Nayani, Rahul Mitra, Raymond Li, Rebekkah Hogan, Robin Battey, Rocky Wang, Rohan Maheswari, Russ Howes, Ruty Rinott, SaiJayesh Bondu, Samyak Datta, Sara Chugh, Sara Hunt, Sargun Dhillon, Sasha Sidorov, Satadru Pan, Saurabh Verma, Seiji Yamamoto, Sharadh Ramaswamy, Shaun Lindsay, Shaun Lindsay, Sheng Feng, Shenghao Lin, ShengxinCindy Zha, Shiva Shankar, Shuqiang Zhang, Shuqiang Zhang, Sinong Wang, Sneha Agarwal, Soji Sajuyigbe, Soumith Chintala, Stephanie Max, Stephen Chen, Steve Kehoe, Steve Satterfield, Sudarshan Govindaprasad, Sumit Gupta, Sungmin Cho, SunnyVirk, Suraj Subramanian, SyChoudhury, Sydney Goldman, Tal Remez, Tamar Glaser, Tamara Best, Thilo Kohler, Thomas Robinson, Tianhe Li, Tianjun Zhang, Tim Matthews, Timothy Chou, Tzook Shaked, Varun Vontimitta, Victoria Ajayi, Victoria Montanez, Vijai Mohan, VinaySatish Kumar, Vishal Mangla, Vlad Ionescu, Vlad Poenaru, VladTiberiu Mihailescu, Vladimir Ivanov, Wei Li, Wenchen Wang, Wenwen Jiang, Wes Bouaziz, Will Constable, Xiaocheng Tang, Xiaofang Wang, Xiaojian Wu, Xiaolan Wang, Xide Xia, Xilun Wu, Xinbo Gao, Yanjun Chen, YeHu, YeJia, YeQi, Yenda Li, Yilin Zhang, Ying Zhang, Yossi Adi, Youngjin Nam, Yu, Wang, Yuchen Hao, Yundi Qian, Yuzi He, Zach Rait, Zachary DeVito, Zef Rosnbrick, Zhaoduo Wen, Zhenyu Yang, and Zhiwei Zhao.The llama 3 herd of models, 2024.
Eiselen and Puttkammer [2014]Roald Eiselen and Martin Puttkammer.Developing text resources for ten South African languages.In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3698–3703, Reykjavik, Iceland, May 2014. European Language Resources Association (ELRA).URL http://www.lrec-conf.org/proceedings/lrec2014/pdf/1151_Paper.pdf.
Eiselen etal. [2023]Roald Eiselen, Rico Koen, Albertus Kruger, and Jacques van Heerden.Nchlt setswana roberta language model, 2023.URL https://hdl.handle.net/20.500.12185/641.
El-Kishky etal. [2020]Ahmed El-Kishky, Vishrav Chaudhary, Francisco GuzmÃ¡n, and Philipp Koehn.CCAligned: A massive collection of cross-lingual web-document pairs.In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pages 5960–5969, Online, November 2020. Association for Computational Linguistics.doi: 10.18653/v1/2020.emnlp-main.480.URL https://www.aclweb.org/anthology/2020.emnlp-main.480.
El-Kishky etal. [2021]Ahmed El-Kishky, Adithya Renduchintala, James Cross, Francisco Guzmán, and Philipp Koehn.Xlent: Mining a large cross-lingual entity dataset with lexical-semantic-phonetic word alignment.In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10424–10430, 2021.
Eldan and Li [2023]Ronen Eldan and Yuanzhi Li.Tinystories: How small can language models be and still speak coherent english?, 2023.
Fan etal. [2021]Angela Fan, Shruti Bhosale, Holger Schwenk, Zhiyi Ma, Ahmed El-Kishky, Siddharth Goyal, Mandeep Baines, Onur Celebi, Guillaume Wenzek, Vishrav Chaudhary, Naman Goyal, Tom Birch, Vitaliy Liptchinsky, Sergey Edunov, Michael Auli, and Armand Joulin.Beyond english-centric multilingual machine translation.Journal of Machine Learning Research, 22(107):1–48, 2021.URL http://jmlr.org/papers/v22/20-1307.html.
Gemini Team etal. [2023]Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, AndrewM. Dai, Anja Hauth, Katie Millican, David Silver, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, PaulR. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, Jack Krawczyk, Cosmo Du, EdChi, Heng-Tze Cheng, Eric Ni, Purvi Shah, Patrick Kane, Betty Chan, Manaal Faruqui, Aliaksei Severyn, Hanzhao Lin, YaGuang Li, Yong Cheng, Abe Ittycheriah, Mahdis Mahdieh, Mia Chen, Pei Sun, Dustin Tran, Sumit Bagri, Balaji Lakshminarayanan, Jeremiah Liu, Andras Orban, Fabian Güra, Hao Zhou, Xinying Song, Aurelien Boffy, Harish Ganapathy, Steven Zheng, HyunJeong Choe, Ágoston Weisz, Tao Zhu, Yifeng Lu, Siddharth Gopal, Jarrod Kahn, Maciej Kula, JeffPitman, Rushin Shah, Emanuel Taropa, MajdAl Merey, Martin Baeuml, Zhifeng Chen, LaurentEl Shafey, Yujing Zhang, Olcan Sercinoglu, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, YiLuan, XiChen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, JackW. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, GauravSingh Tomar, Evan Senter, Martin Chadwick, Ilya Kornakov, Nithya Attaluri, Iñaki Iturrate, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang,Jordan Grimstad, AleJakse Hartman, Xavier Garcia, ThanumalayanSankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego deLasCasas, Dasha Valter, Connie Tao, Lorenzo Blanco, AdriàPuigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Ravi Addanki, Antoine Miech, Annie Louis, Denis Teplyashin, Geoff Brown, Elliot Catt, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M.R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, DmitryLepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, TomLe Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaliy Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, ClaraHuiyi Hu, Raoul deLiedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, ReinaldKim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, KiranVodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George vanden Driessche, Tao Wang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, PaulKishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, Loren Maggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell,Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, RaphaëlLopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, Jinwei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen, CharlineLe Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, JaimeAlonso Lorenzo, LarsLowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, LivioBaldini Soares, Kate Baumli, MichaelB. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic,Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, VíctorCampos Campos, Alex Tomala, Yunhao Tang, DaliaEl Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek Jindal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, CeZheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, LisaAnne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, SayedHadi Hashemi, Richard Ives, Yana Hasson, Eric Noland, Yuan Cao, Nathan Byrd, LeHou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost vanAmersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, NicolaDe Cao, Charlie Chen, Sidharth Mudgal, Romina Stella, Kevin Brooks, Gautam Vasudevan, Chenxi Liu, Mainak Chain, Nivedita Melinkeri, Aaron Cohen, Venus Wang, Kristie Seymore, Sergey Zubkov, Rahul Goel, Summer Yue, Sai Krishnakumaran, Brian Albert, Nate Hurley, Motoki Sano, Anhad Mohananey, Jonah Joughin, Egor Filonov, Tomasz Kępa, Yomna Eldawy, Jiawern Lim, Rahul Rishi, Shirin Badiezadegan, Taylor Bos, Jerry Chang, Sanil Jain, Sri GayatriSundara Padmanabhan, Subha Puttagunta, Kalpesh Krishna, Leslie Baker, Norbert Kalb, Vamsi Bedapudi, Adam Kurzrok, Shuntong Lei, Anthony Yu, Oren Litvin, Xiang Zhou, Zhichun Wu, Sam Sobell, Andrea Siciliano, Alan Papir, Robby Neale, Jonas Bragagnolo, Tej Toor, Tina Chen, Valentin Anklin, Feiran Wang, Richie Feng, MiladGholami, Kevin Ling, Lijuan Liu, Jules Walter, Hamid Moghaddam, Arun Kishore, Jakub Adamek, Tyler Mercado, Jonathan Mallinson, Siddhinita Wandekar, Stephen Cagle, Eran Ofek, Guillermo Garrido, Clemens Lombriser, Maksim Mukha, Botu Sun, HafeezulRahman Mohammad, Josip Matak, Yadi Qian, Vikas Peswani, Pawel Janus, Quan Yuan, Leif Schelin, Oana David, Ankur Garg, Yifan He, Oleksii Duzhyi, Anton Älgmyr, Timothée Lottaz, QiLi, Vikas Yadav, Luyao Xu, Alex Chinien, Rakesh Shivanna, Aleksandr Chuklin, Josie Li, Carrie Spadine, Travis Wolfe, Kareem Mohamed, Subhabrata Das, Zihang Dai, Kyle He, Daniel von Dincklage, Shyam Upadhyay, Akanksha Maurya, Luyan Chi, Sebastian Krause, Khalid Salama, PamG Rabinovitch, Pavan KumarReddy M, Aarush Selvan, Mikhail Dektiarev, Golnaz Ghiasi, Erdem Guven, Himanshu Gupta, Boyi Liu, Deepak Sharma, IdanHeimlich Shtacher, Shachi Paul, Oscar Akerlund, François-Xavier Aubet, Terry Huang, Chen Zhu, Eric Zhu, Elico Teixeira, Matthew Fritze, Francesco Bertolini, Liana-EleonoraMarinescu, Martin Bölle, Dominik Paulus, Khyatti Gupta, Tejasi Latkar, Max Chang, Jason Sanders, Roopa Wilson, Xuewei Wu, Yi-Xuan Tan, LamNguyen Thiet, Tulsee Doshi, Sid Lall, Swaroop Mishra, Wanming Chen, Thang Luong, Seth Benjamin, Jasmine Lee, Ewa Andrejczuk, Dominik Rabiej, Vipul Ranjan, Krzysztof Styrc, Pengcheng Yin, Jon Simon, MalcolmRose Harriott, Mudit Bansal, Alexei Robsky, Geoff Bacon, David Greene, Daniil Mirylenka, Chen Zhou, Obaid Sarvana, Abhimanyu Goyal, Samuel Andermatt, Patrick Siegler, Ben Horn, Assaf Israel, Francesco Pongetti, Chih-Wei"Louis" Chen, Marco Selvatici, Pedro Silva, Kathie Wang, Jackson Tolins, Kelvin Guu, Roey Yogev, Xiaochen Cai, Alessandro Agostini, Maulik Shah, Hung Nguyen, NoahÓ Donnaile, Sébastien Pereira, Linda Friso, Adam Stambler, Adam Kurzrok, Chenkai Kuang, Yan Romanikhin, Mark Geller, ZJYan, Kane Jang, Cheng-Chun Lee, Wojciech Fica, Eric Malmi, Qijun Tan, Dan Banica, Daniel Balle, Ryan Pham, Yanping Huang, Diana Avram, Hongzhi Shi, Jasjot Singh, ChrisHidey, Niharika Ahuja, Pranab Saxena, Dan Dooley, SrividyaPranavi Potharaju, Eileen O’Neill, Anand Gokulchandran, Ryan Foley, Kai Zhao, Mike Dusenberry, Yuan Liu, Pulkit Mehta, Ragha Kotikalapudi, Chalence Safranek-Shrader, Andrew Goodman, Joshua Kessinger, Eran Globen, Prateek Kolhar, Chris Gorgolewski, Ali Ibrahim, Yang Song, Ali Eichenbaum, Thomas Brovelli, Sahitya Potluri, Preethi Lahoti, Cip Baetu, Ali Ghorbani, Charles Chen, Andy Crawford, Shalini Pal, Mukund Sridhar, Petru Gurita, Asier Mujika, Igor Petrovski, Pierre-Louis Cedoz, Chenmei Li, Shiyuan Chen, NiccolòDal Santo, Siddharth Goyal, Jitesh Punjabi, Karthik Kappaganthu, Chester Kwak, Pallavi LV, Sarmishta Velury, Himadri Choudhury, Jamie Hall, Premal Shah, Ricardo Figueira, Matt Thomas, Minjie Lu, Ting Zhou, Chintu Kumar, Thomas Jurdi, Sharat Chikkerur, Yenai Ma, Adams Yu, Soo Kwak, Victor Ähdel, Sujeevan Rajayogam, Travis Choma, Fei Liu, Aditya Barua, Colin Ji, JiHo Park, Vincent Hellendoorn, Alex Bailey, Taylan Bilal, Huanjie Zhou,Mehrdad Khatir, Charles Sutton, Wojciech Rzadkowski, Fiona Macintosh, Konstantin Shagin, Paul Medina, Chen Liang, Jinjing Zhou, Pararth Shah, Yingying Bi, Attila Dankovics, Shipra Banga, Sabine Lehmann, Marissa Bredesen, Zifan Lin, JohnEric Hoffmann, Jonathan Lai, Raynald Chung, Kai Yang, Nihal Balani, Arthur Bražinskas, Andrei Sozanschi, Matthew Hayes, HéctorFernández Alcalde, Peter Makarov, Will Chen, Antonio Stella, Liselotte Snijders, Michael Mandl, Ante Kärrman, Paweł Nowak, Xinyi Wu, Alex Dyck, Krishnan Vaidyanathan, Raghavender R, Jessica Mallet, Mitch Rudominer, Eric Johnston, Sushil Mittal, Akhil Udathu, Janara Christensen, Vishal Verma, Zach Irving, Andreas Santucci, Gamaleldin Elsayed, Elnaz Davoodi, Marin Georgiev, Ian Tenney, Nan Hua, Geoffrey Cideron, Edouard Leurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Dylan Scandinaro, Heinrich Jiang, Jasper Snoek, Mukund Sundararajan, Xuezhi Wang, Zack Ontiveros, Itay Karo, Jeremy Cole, Vinu Rajashekhar, Lara Tumeh, EyalBen-David, Rishub Jain, Jonathan Uesato, Romina Datta, Oskar Bunyan, Shimu Wu, John Zhang, Piotr Stanczyk, YeZhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, JaumeSanchez Elias, Afroz Mohiuddin, Faizan Muhammad, Jin Miao, Andrew Lee, Nino Vieillard, Jane Park, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Geoffrey Irving, Edward Loper, Michael Fink, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Ivan Petrychenko, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Peter Grabowski, YuMao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Evan Palmer, Paul Suganthan, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, DianaGage Wright, Yawen Wei, Harsha Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le,LuLi, Chimezie Iwuanyanwu, LuLiu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Ginger Perng, ElenaAllica Abellan, Mingyang Zhang, Ish*ta Dasgupta, Nate Kushman, Ivo Penchev, Alena Repina, Xihui Wu, Tom vander Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Daniel Andor, Pedro Valenzuela, Minnie Lui, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, DucDung Nguyen, Paula Kurylowicz, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, MariaAbi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Ken Franko, Anna Bulanova, Rémi Leblond, Shirley Chung, Harry Askham, LuisC. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin,Chris Alberti, Chu-Cheng Lin, Colin Evans, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Mark Omernick, Colton Bishop, Rachel Sterneck, Rohan Jain, Jiawei Xia, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, DanielJ. Mankowitz, Alex Polozov, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Matthieu Geist, Ser tan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Kathy Wu, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, TianHuey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, SushantKafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, Jing Li, Saaber Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Yeqing Li, Nir Levine, Ariel Stolovich, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Charlie Deck, Hyo Lee, Zonglin Li, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Sho Arora, Christy Koh, SoheilHassas Yeganeh, Siim Põder, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Vinod Koverkathu, ChristopherA. Choquette-Choo, Yunjie Li, TJLu, Abe Ittycheriah, Prakash Shroff, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Guillaume Desjardins, Marco Cornero, BronaRobenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia vander Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David Bridson, Dario deCesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Lynette Webb, Sahil Dua, Dong Li, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, ShreyasRammohan Belle, Lei Wang, Chetan Tekur, MihirSanjay Kale, Jinliang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, YiSun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, ManishReddy Vuyyuru, John Aslanides, Nidhi Vyas,Martin Wicke, Xiao Ma, Evgenii Eltyshev, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, XiXiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, JiLiu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Christof Angermueller, Xiaowei Li, Anoop Sinha, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MKBlake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, Jinhyuk Lee, Denny Zhou, Komal Jalan, Dinghua Li, Blake Hechtman, Parker Schuh, Milad Nasr, Kieran Milan, Vladimir Mikulik, Juliana Franco, Tim Green, Nam Nguyen, Joe Kelley, Aroma Mahendru, Andrea Hu, JoshuaHowland, Ben Vargas, Jeffrey Hui, Ksh*tij Bansal, Vikram Rao, Rakesh Ghiya, Emma Wang, KeYe, JeanMichel Sarr, MelanieMoranski Preston, Madeleine Elish, Steve Li, Aakash Kaku, Jigar Gupta, Ice Pasupat, Da-Cheng Juan, Milan Someswar, Tejvi M., Xinyun Chen, Aida Amini, Alex Fabrikant, Eric Chu, Xuanyi Dong, Amruta Muthal, Senaka Buthpitiya, Sarthak Jauhari, Nan Hua, Urvashi Khandelwal, Ayal Hitron, Jie Ren, Larissa Rinaldi, Shahar Drath, Avigail Dabush, Nan-Jiang Jiang, Harshal Godhia, Uli Sachs, Anthony Chen, Yicheng Fan, Hagai Taitelbaum, Hila Noga, Zhuyun Dai, James Wang, Chen Liang, Jenny Hamer, Chun-Sung Ferng, Chenel Elkind, Aviel Atias, Paulina Lee, Vít Listík, Mathias Carlen, Jan vande Kerkhof, Marcin Pikus, Krunoslav Zaher, Paul Müller, Sasha Zykova, Richard Stefanec, Vitaly Gatsko, Christoph Hirnschall, Ashwin Sethi, XingyuFederico Xu, Chetan Ahuja, Beth Tsai, Anca Stefanoiu, BoFeng, Keshav Dhandhania, Manish Katyal, Akshay Gupta, Atharva Parulekar, Divya Pitta, Jing Zhao, Vivaan Bhatia,Yashodha Bhavnani, Omar Alhadlaq, Xiaolin Li, Peter Danenberg, Dennis Tu, Alex Pine, Vera Filippova, Abhipso Ghosh, Ben Limonchik, Bhargava Urala, ChaitanyaKrishna Lanka, Derik Clive, YiSun, Edward Li, Hao Wu, Kevin Hongtongsak, Ianna Li, Kalind Thakkar, Kuanysh Omarov, Kushal Majmundar, Michael Alverson, Michael Kucharski, Mohak Patel, Mudit Jain, Maksim Zabelin, Paolo Pelagatti, Rohan Kohli, Saurabh Kumar, Joseph Kim, Swetha Sankar, Vineet Shah, Lakshmi Ramachandruni, Xiangkai Zeng, Ben Bariach, Laura Weidinger, TuVu, Alek Andreev, Antoine He, Kevin Hui, Sheleem Kashem, Amar Subramanya, Sissie Hsiao, Demis Hassabis, Koray Kavukcuoglu, Adam Sadovsky, Quoc Le, Trevor Strohman, Yonghui Wu, Slav Petrov, Jeffrey Dean, and Oriol Vinyals.Gemini: A family of highly capable multimodal models, 2023.
Gemma Team etal. [2024]Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, MihirSanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, PierGiuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, CharlineLe Lan, ChristopherA. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, LarsLowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem,Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, SamuelL Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yuhui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy.Gemma: Open models based on gemini research and technology, 2024.
Glaive AI [2023]Glaive AI.Glaive code assistant, 2023.URL https://huggingface.co/datasets/glaiveai/glaive-code-assistant-v3.
Goldhahn etal. [2012]Dirk Goldhahn, Thomas Eckart, and Uwe Quasthoff.Building large monolingual dictionaries at the Leipzig corpora collection: From 100 to 200 languages.In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, MehmetUğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 759–765, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/327_Paper.pdf.
Google Research [2024]Google Research.Gatitos, 2024.URL https://github.com/google-research/url-nlp/tree/main/gatitos.
Government of Botswana [2024]Government of Botswana.About our country.https://www.gov.bw/about-our-country, 2024.Accessed: 2024-07-30.
Goyal etal. [2022]Naman Goyal, Cynthia Gao, Vishrav Chaudhary, Peng-Jen Chen, Guillaume Wenzek, DaJu, Sanjana Krishnan, Marc’Aurelio Ranzato, Francisco Guzmán, and Angela Fan.The Flores-101 evaluation benchmark for low-resource and multilingual machine translation.Transactions of the Association for Computational Linguistics, 10:522–538, 2022.doi: 10.1162/tacl_a_00474.URL https://aclanthology.org/2022.tacl-1.30.
Groeneveld etal. [2024]Dirk Groeneveld, IzBeltagy, Pete Walsh, Aksh*ta Bhagia, Rodney Kinney, Oyvind Tafjord, AnanyaHarsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, KhyathiRaghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, MatthewE. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, NoahA. Smith, and Hannaneh Hajishirzi.Olmo: Accelerating the science of language models, 2024.
Gunasekar etal. [2023]Suriya Gunasekar, YiZhang, Jyoti Aneja, Caio CésarTeodoro Mendes, AllieDel Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo deRosa, Olli Saarikivi, Adil Salim, sh*tal Shah, HarkiratSingh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, AdamTauman Kalai, YinTat Lee, and Yuanzhi Li.Textbooks are all you need, 2023.
Guo etal. [2024]Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guanting Chen, Xiao Bi, Y.Wu, Y.K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang.Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
Hendrycks etal. [2021a]Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt.Aligning ai with shared human values.Proceedings of the International Conference on Learning Representations (ICLR), 2021a.
Hendrycks etal. [2021b]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt.Measuring massive multitask language understanding.Proceedings of the International Conference on Learning Representations (ICLR), 2021b.
iAfrika [2024]iAfrika.iafrika, 2024.URL https://iafrika.org/.
Kudugunta etal. [2023]Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat.Madlad-400: A multilingual and document-level large audited dataset.In A.Oh, T.Naumann, A.Globerson, K.Saenko, M.Hardt, and S.Levine, editors, Advances in Neural Information Processing Systems, volume36, pages 67284–67296. Curran Associates, Inc., 2023.URL https://proceedings.neurips.cc/paper_files/paper/2023/file/d49042a5d49818711c401d34172f9900-Paper-Datasets_and_Benchmarks.pdf.
Laurençon etal. [2023]Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, AlbertVillanova del Moral, TevenLe Scao, LeandroVon Werra, Chenghao Mou, EduardoGonzález Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gerard Dupont, Stella Biderman, Anna Rogers, LoubnaBen allal, FrancescoDe Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier dela Rosa, Paulo Villegas, Tristan Thrush, Shayne Longpre, Sebastian Nagel, Leon Weber, Manuel Muñoz, Jian Zhu, DanielVan Strien, Zaid Alyafeai, Khalid Almubarak, MinhChien Vu, Itziar Gonzalez-Dios, Aitor Soroa, Kyle Lo, Manan Dey, PedroOrtiz Suarez, Aaron Gokaslan, Shamik Bose, David Adelani, Long Phan, Hieu Tran, Ian Yu, Suhas Pai, Jenny Chim, Violette Lepercq, Suzana Ilic, Margaret Mitchell, SashaAlexandra Luccioni, and Yacine Jernite.The bigscience roots corpus: A 1.6tb composite multilingual dataset, 2023.
Li etal. [2024]Jeffrey Li, Alex Fang, Georgios Smyrnis, Maor Ivgi, Matt Jordan, Samir Gadre, Hritik Bansal, Etash Guha, Sedrick Keh, Kushal Arora, Saurabh Garg, Rui Xin, Niklas Muennighoff, Reinhard Heckel, Jean Mercat, Mayee Chen, Suchin Gururangan, Mitchell Wortsman, Alon Albalak, Yonatan Bitton, Marianna Nezhurina, Amro Abbas, Cheng-Yu Hsieh, Dhruba Ghosh, Josh Gardner, Maciej Kilian, Hanlin Zhang, Rulin Shao, Sarah Pratt, Sunny Sanyal, Gabriel Ilharco, Giannis Daras, Kalyani Marathe, Aaron Gokaslan, Jieyu Zhang, Khyathi Chandu, Thao Nguyen, Igor Vasiljevic, Sham Kakade, Shuran Song, Sujay Sanghavi, Fartash fa*ghri, Sewoong Oh, Luke Zettlemoyer, Kyle Lo, Alaaeldin El-Nouby, Hadi Pouransari, Alexander Toshev, Stephanie Wang, Dirk Groeneveld, Luca Soldaini, PangWei Koh, Jenia Jitsev, Thomas Kollar, AlexandrosG. Dimakis, Yair Carmon, Achal Dave, Ludwig Schmidt, and Vaishaal Shankar.Datacomp-lm: In search of the next generation of training sets for language models, 2024.
Lingua [2024]Live Lingua.Learn setswana, 2024.URL https://www.livelingua.com/courses/setswana.
Liu etal. [2020]Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, and Luke Zettlemoyer.Multilingual denoising pre-training for neural machine translation, 2020.
Lu etal. [2024]Yinquan Lu, Wenhao Zhu, Lei Li, YuQiao, and Fei Yuan.Llamax: Scaling linguistic horizons of llm by enhancing translation capabilities beyond 100 languages.arXiv preprint arXiv:2407.05975, 2024.
Marivate etal. [2020]Vukosi Marivate, Tshephisho Sefara, Vongani Chabalala, Keamogetswe Makhaya, Tumisho Mokgonyane, Rethabile Mokoena, and Abiodun Modupe.Investigating an approach for low resource language dataset creation, curation and classification: Setswana and sepedi.In Proceedings of the first workshop on Resources for African Indigenous Languages, pages 15–20, Marseille, France, 2020. European Language Resources Association (ELRA).ISBN 979-10-95546-60-3.URL https://aclanthology.org/2020.rail-1.3.
Marivate etal. [2023]Vukosi Marivate, Moseli Mots’Oehli, Valencia Wagner, Richard Lastrucci, and Isheanesu Dzingirai.Puoberta: Training and evaluation of a curated language model for setswana.In SACAIR 2023 (To Appear), 2023.
McKellar etal. [2016]Cindy McKellar, Roald Eiselen, and Wikus Pienaar.Autshumato english-setswana parallel corpora, 2016.URL https://hdl.handle.net/20.500.12185/404.
Mistral AI [2024]Mistral AI.Mathstral, 2024.URL https://mistral.ai/news/mathstral/.
Mistry and Gare [1987]KarenS. Mistry and Grace Gare.An introduction to spoken setswana, 1987.URL https://eric.ed.gov/?id=ED283381.
Mokoka [2024]Keneilwe Mokoka.Exploring machine translation for code-switching between english and setswana in south african classrooms, 2024.URL https://openreview.net/pdf?id=UoI4pdB2pY.
Moon etal. [2020]Jihyung Moon, Hyunchang Cho, and EunjeongL. Park.Revisiting round-trip translation for quality estimation.In André Martins, Helena Moniz, Sara Fumega, Bruno Martins, Fernando Batista, Luisa Coheur, Carla Parra, Isabel Trancoso, Marco Turchi, Arianna Bisazza, Joss Moorkens, Ana Guerberof, Mary Nurminen, Lena Marg, and MikelL. Forcada, editors, Proceedings of the 22nd Annual Conference of the European Association for Machine Translation, pages 91–104, Lisboa, Portugal, November 2020. European Association for Machine Translation.URL https://aclanthology.org/2020.eamt-1.11.
Motsoehli [2020]Moseli Motsoehli.Tswanabert, 2020.URL https://huggingface.co/MoseliMotsoehli/TswanaBert.
Nations [1998]United Nations.Universal declaration of human rights - western sotho/tswana/setswana, 1998.URL https://www.ohchr.org/en/human-rights/universal-declaration/translations/western-sothotswanasetswana.
NLLB Team etal. [2022]NLLB Team, MartaR. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, AlYoungblood, Bapi Akula, Loic Barrault, GabrielMejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, KaushikRam Sadagopan, Dirk Rowe, Shannon Spruit, Chau Tran, Pierre Andrews, NecipFazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang.No language left behind: Scaling human-centered machine translation, 2022.
Ogueji etal. [2021]Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin.Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages.In Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and GozdeGul Sahin, editors, Proceedings of the 1st Workshop on Multilingual Representation Learning, pages 116–126, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.doi: 10.18653/v1/2021.mrl-1.11.URL https://aclanthology.org/2021.mrl-1.11.
OpenAI [2024]OpenAI.Hello gpt-4o, 2024.URL https://openai.com/index/hello-gpt-4o/.
OpenAI etal. [2023]OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, FlorenciaLeoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, HyungWon Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, SimónPosada Fishman, Juston Forte, Isabella Fulford, LeoGao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, ShixiangShane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, NitishShirish Keskar, Tabarak Khan, Logan Kilpatrick, JongWook Kim, Christina Kim, Yongjik Kim, JanHendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, ChakMing Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, RyanLowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, ScottMayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe deAvila BelbutePeres, Michael Petrov, HenriquePonde deOliveiraPinto, Michael, Pokorny, Michelle Pokrass, VitchyrH. Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez,Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, FelipePetroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, MadeleineB. Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan FelipeCerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, JustinJay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJWeinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, ShengjiaZhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph.Gpt-4 technical report, 2023.
Papineni etal. [2002]Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu.Bleu: a method for automatic evaluation of machine translation.In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
Parliament of Botswana [2022]Parliament of Botswana.Hansard, 2022.URL https://www.parliament.gov.bw/index.php?option=com_documents&view=files&catid=87&Itemid=438.
Penedo etal. [2024]Guilherme Penedo, Hynek Kydlíček, LoubnaBen allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, LeandroVon Werra, and Thomas Wolf.The fineweb datasets: Decanting the web for the finest text data at scale, 2024.URL https://arxiv.org/abs/2406.17557.
Plaza etal. [2024]Irene Plaza, Nina Melero, Cristina del Pozo, Javier Conde, Pedro Reviriego, Marina Mayor-Rocher, and María Grandury.Spanish and llm benchmarks: is mmlu lost in translation?, 2024.
Popović [2015]Maja Popović.chrF: character n-gram F-score for automatic MT evaluation.In Ondřej Bojar, Rajan Chatterjee, Christian Federmann, Barry Haddow, Chris Hokamp, Matthias Huck, Varvara Logacheva, and Pavel Pecina, editors, Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal, September 2015. Association for Computational Linguistics.doi: 10.18653/v1/W15-3049.URL https://aclanthology.org/W15-3049.
Rahlao etal. [2021]Malebogo Rahlao, Nina Lewin, and Taariq Surtee.New uses for old books: Description of digitised corpora-based on the setswana language collection in the wits cullen africana collection.In Proceedings of the International Conference of the Digital Humanities Association of Southern Africa (DHASA), Johannesburg, South Africa, 2021. The University of the Witwatersrand, Johannesburg.
Rajbhandari etal. [2020]Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He.Zero: memory optimizations toward training trillion parameter models.In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’20. IEEE Press, 2020.ISBN 9781728199986.
Rasley etal. [2020]Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He.Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters.In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery.ISBN 9781450379984.doi: 10.1145/3394486.3406703.URL https://doi.org/10.1145/3394486.3406703.
SADiLaR [2024]SADiLaR.South african centre for digital language resources (sadilar), 2024.URL https://sadilar.org/en/.
Schwenk etal. [2021]Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan.CCMatrix: Mining billions of high-quality parallel sentences on the web.In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6490–6500, Online, August 2021. Association for Computational Linguistics.doi: 10.18653/v1/2021.acl-long.507.URL https://aclanthology.org/2021.acl-long.507.
Soldaini etal. [2024]Luca Soldaini, Rodney Kinney, Aksh*ta Bhagia, Dustin Schwenk, David Atkinson, Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar, Valentin Hofmann, AnanyaHarsh Jha, Sachin Kumar, LiLucy, Xinxi Lyu, Nathan Lambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, MatthewE. Peters, Abhilasha Ravichander, Kyle Richardson, Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh, Luke Zettlemoyer, NoahA. Smith, Hannaneh Hajishirzi, IzBeltagy, Dirk Groeneveld, Jesse Dodge, and Kyle Lo.Dolma: an open corpus of three trillion tokens for language model pretraining research, 2024.
Suárez etal. [2019]Pedro JavierOrtiz Suárez, Benoît Sagot, and Laurent Romary.Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures.In Piotr Bański, Adrien Barbaresi, Hanno Biber, Evelyn Breiteneder, Simon Clematide, Marc Kupietz, Harald Lüngen, and Caroline Iliadi, editors, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, Proceedings of the Workshop on Challenges in the Management of Large Corpora (CMLC-7) 2019. Cardiff, 22nd July 2019, pages 9 – 16, Mannheim, 2019. Leibniz-Institut für Deutsche Sprache.doi: 10.14618/ids-pub-9021.URL https://nbn-resolving.org/urn:nbn:de:bsz:mh39-90215.
Tatoeba [2024]Tatoeba.Tatoeba, 2024.URL https://tatoeba.org/en/.
Teknium [2023]Teknium.Openhermes 2.5: An open dataset of synthetic data for generalist llm assistants, 2023.URL https://huggingface.co/datasets/teknium/OpenHermes-2.5.
The Parliament of Zimbabwe [2013]The Parliament of Zimbabwe.Constitution of zimbabwe.https://parlzim.gov.zw/download/constitution-of-zimbabwe-amendment-no-20-14-05-2013/, 2013.Accessed: 2024-07-30.
The Republic of South Africa [1996]The Republic of South Africa.The south african constitution.https://www.justice.gov.za/constitution/SAConstitution-web-eng.pdf, 1996.Accessed: 2024-07-30.
Tiedemann [2012]Jörg Tiedemann.Parallel data, tools and interfaces in OPUS.In Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, MehmetUğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218, Istanbul, Turkey, May 2012. European Language Resources Association (ELRA).URL http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf.
Tlhalefang [2009]Tlhalefang.Tlhalefang communications, 2009.URL https://tlhalefang.com/setswana/.
TRK [2021]TRK.Setswana mo botswana, 2021.URL https://setswanabwa.com/.
Unisa [2023]Unisa.Learn to speak an african language, 2023.URL https://www.unisa.ac.za/sites/corporate/default/Unisa-Open/OER-@-Unisa/Learn-to-speak-an-African-Language.
Vaswani etal. [2017]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need, 2017.
Wang etal. [2024]Tiannan Wang, Jiamin Chen, Qingrui Jia, Shuai Wang, Ruoyu Fang, Huilin Wang, Zhaowei Gao, Chunzhao Xie, Chuou Xu, Jihong Dai, Yibin Liu, Jialong Wu, Shengwei Ding, Long Li, Zhiwei Huang, Xinle Deng, Teng Yu, Gangan Ma, Han Xiao, Zixin Chen, Danjun Xiang, Yunxia Wang, Yuanyuan Zhu, YiXiao, Jing Wang, Yiru Wang, Siran Ding, Jiayang Huang, Jiayi Xu, Yilihamu Tayier, Zhenyu Hu, Yuan Gao, Chengfeng Zheng, Yueshu Ye, Yihang Li, Lei Wan, Xinyue Jiang, Yujie Wang, Siyu Cheng, Zhule Song, Xiangru Tang, Xiaohua Xu, Ningyu Zhang, Huajun Chen, YuchenEleanor Jiang, and Wangchunshu Zhou.Weaver: Foundation models for creative writing, 2024.
Xiao etal. [2023]Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, CeLiu, and LuYuan.Florence-2: Advancing a unified representation for a variety of vision tasks, 2023.
Yang etal. [2024]AnYang, Baosong Yang, Binyuan Hui, BoZheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, NaNi, Pei Zhang, Peng Wang, RuPeng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, YuWan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan.Qwen2 technical report, 2024.
Yuan etal. [2023]Fei Yuan, Yinquan Lu, Wenhao Zhu, Lingpeng Kong, Lei Li, YuQiao, and Jingjing Xu.Lego-MT: Learning detachable models for massively multilingual machine translation.In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 11518–11533, Toronto, Canada, July 2023. Association for Computational Linguistics.doi: 10.18653/v1/2023.findings-acl.731.URL https://aclanthology.org/2023.findings-acl.731.
Zhao etal. [2024]Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng.Wildchat: 1m chatgpt interaction logs in the wild, 2024.
Üstün etal. [2024]Ahmet Üstün, Viraat Aryabumi, Zheng-Xin Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, Freddie Vargus, Phil Blunsom, Shayne Longpre, Niklas Muennighoff, Marzieh Fadaee, Julia Kreutzer, and Sara Hooker.Aya model: An instruction finetuned open-access multilingual language model, 2024.