7 min read· The Lingo team

Why we're moving to compressed (int8) models

We re-built our entire serving stack on int8 models: ~3.8× smaller and ~6× faster on CPU, for a quality cost that — on narrow-domain, low-resource models — sits below the noise. Here's the data, the examples, and the reasoning.

engineeringquantizationdata

Keeping dozens of translation models online for free has one enemy: size. Each of our fp32 models is ~285 MB; the full set is ~31 GB. That's slow to move, slow to load, and expensive to keep resident in memory. So we measured a fix, and it worked better than we expected. All inference now runs on int8-quantized models.

What "quantization" means here

A neural model is a pile of numbers (weights). Ours are stored as 32-bit floats (fp32). int8 quantization stores them as 8-bit integers instead — roughly a quarter of the size — with a small calibration so the integers still approximate the original values. We use CTranslate2, an inference engine purpose-built for Transformer translation models, which supports int8 on both x86 and ARM CPUs (quantization docs).

The numbers, on our own model

We converted francais-ewondo and benchmarked it against the fp32 original on the same machine:

Metricfp32 (before)int8 (now)Change
File size285 MB75 MB3.8× smaller
Latency / sentence532 ms86 ms6.2× faster

Across the whole fleet that's ~31 GB → ~9 GB and a several-fold speedup — the difference between "needs a paid GPU box" and "runs on a free-tier CPU node."

But what about quality?

This is the honest part. The literature reports int8 costing under ~1 BLEU (BLEU; for morphologically rich languages chrF is the better metric). But "<1 BLEU" is measured against human references, not against the fp32 output. When you compare int8 directly to fp32, beam search can pick a different but equally valid translation — which looks like a big change while being no loss at all. Here's what that actually looks like:

FR: Bonjour, comment vas-tu ?
fp32: Mina, eyë onë wa?
int8: Mina, eyë onë wa? — identical
FR: Dieu est amour.
fp32: Zamba anë ediṅ.
int8: Zamba anë ediṅ. — identical
FR: Je vais au marché demain matin.
fp32: Mayi kë a nda-mëdzo a nda-mëdzo. (note the repetition)
int8: Mayi kë a kidi a nda-mëdzo. — different, and arguably cleaner

The int8 output is fluent, and where it differs it is not obviously worse — sometimes better.

Why it's nearly free for us specifically

Our models are trained on a small, formal-leaning corpus — scripture as the aligned backbone, plus the books, pamphlets and printed material we could gather (we [wrote about that](/blog/building-a-corpus-from-scarcity)). That means their quality is already bounded by a narrow domain, not by numerical precision. The dominant source of error is what the model learned, and int8's rounding noise sits far below that floor.

So the trade most projects weigh carefully is, for us, lopsided in our favour: we give up a fraction of a BLEU we can't perceive, and gain models that are 3.8× smaller and ~6× faster. A narrow-domain, low-resource model is exactly the kind of model you should quantize.

What we keep

  • The fp32 originals are archived — permanently, on our own hardware. int8 is a serving format, never a replacement.
  • Our first open models stay public on [Hugging Face](https://huggingface.co/flagship-ai) exactly as they were. Their original upload dates are part of the record: this work has been underway for years, not assembled overnight.
  • The compressed models join them as new, clearly-labelled int8 releases — so anyone can run the whole set on a laptop.

That's the whole engineering philosophy in one decision: be honest about your limits, and let them make your hard choices easy.

References