February 20, 2024 7 min read· The Lingo team

Building a corpus from scarcity

Training data for these languages doesn't sit in a database — we had to go and find it: scanned books, pamphlets, blogs, purchased booklets, and scripture as the aligned backbone, all unified under one alphabet.

datacorpusmethodology

Machine translation is hungry for parallel text — the same content in two languages, sentence for sentence. For English–French you can find billions of such pairs. For most Cameroonian languages you can find approximately none, and what little exists is scattered across formats, spellings, and physical shelves. So building the corpus was, more than anything, an act of collection.

Going and finding the data

There is no dataset to download. We gathered written material wherever it lived: school textbooks and language primers, storybooks, religious pamphlets and church bulletins, language-learning booklets, the occasional blog or Facebook post, government literacy materials. Some of it we scanned or photographed page by page; some we transcribed by hand; some we bought as physical books because that was the only way to get it. Every source arrived in its own encoding, layout, and spelling conventions.

The aligned backbone: scripture

Most of that material is monolingual — useful, but you can't train a translator on text that isn't paired with another language. For parallel text, one source stands above all others. For centuries, Bible-translation organisations have translated the same long, structured text into thousands of small languages, verse by verse, with a built-in alignment (chapter and verse numbers) mapping each fragment to its counterpart in every other language.

So scripture became the aligned backbone of the corpus — frequently the single largest, cleanest, sentence-aligned parallel text that exists for a low-resource language. We compiled and aligned it across dozens of Cameroonian languages and open-sourced it as cameroon_bibles. But it is the backbone, not the body: the rest of the gathered material fills out vocabulary, style, and coverage the verses can't reach.

One alphabet to unify them

The hardest problem wasn't quantity — it was inconsistency. The same language is spelled different ways by different authors, and many have no standard orthography at all. A model can't learn if "the same word" looks like three different words.

So we standardised on AGLC — the General Alphabet of Cameroonian Languages (Alphabet Général des Langues Camerounaises), a phonemic system designed precisely to write Cameroon's languages consistently. It's an elegant, unifying idea. It is also, unfortunately, not used by everyone, so a large part of the work was normalising messy, inconsistent sources toward AGLC and building our tokenizer around it. One alphabet, one tokenizer, many sources made comparable.

A tailwind from policy

We had help from an unexpected direction: the government's policy of teaching national languages in schools. That policy is quietly generating new written content — primers, workbooks, exam materials — and, just as importantly, a generation more used to seeing these languages written. More written content tomorrow means better models tomorrow.

Being honest about the limits

Even with all of this, the corpus is small and skews formal: scripture, schoolbooks, and printed matter, not how people actually greet, bargain, or tell jokes. Its vocabulary is thin on the everyday and the modern. This matters downstream — including, much later, our decision to compress the models: when a model's ceiling is set by a narrow, formal domain, the dominant source of error is the data, not the last bits of numerical precision. We'll return to that.

The text corpus was never the destination. It was the on-ramp — enough to train first models, and to make the real goal reachable: open voice data, contributed by speakers themselves.