6 min read· The Lingo team

Fitting 56 languages on a free server: int8 and the economics of always-on

How we keep dozens of translation models online for almost nothing: CPU nodes that yield to GPUs, idle-eviction, and int8 quantization that's nearly free precisely because our models are trained on a narrow, formal corpus.

engineeringquantizationinfrastructure

An archive only counts if it's reachable. Ours has to stay online indefinitely on a budget that rounds to zero. Here's how the translator actually runs.

A fleet that costs nothing to idle

Translations are served by a small fleet of workers that pull jobs from a shared queue. Nodes advertise their liveness and capability with a short-lived heartbeat:

  • A free-tier CPU node (an ARM box on an always-free cloud tier) holds the floor — it's always on, slow but free.
  • When a more powerful GPU/priority machine comes online, it announces itself and the CPU nodes yield — the fast machine takes over everything until it leaves.
  • The website reads those heartbeats directly, so an open browser tab knows in real time whether it's talking to a fast node or a thrifty one — and shows a banner when responses will be slow.

Models are loaded on demand and evicted after idle time, so a 56-language fleet never has to hold 56 models in RAM at once.

Making the models smaller

The biggest lever is the models themselves. Each fp32 model is ~285 MB; the full set is ~31 GB — slow to move and store. We're moving serving to CTranslate2 with int8 quantization, which:

  • Shrinks each model to ~80–100 MB (the whole set from ~31 GB to ~9 GB).
  • Runs 2–4× faster on CPU, including on ARM.
  • Costs under ~1 BLEU of quality — within the noise of how MT quality is even measured.

Why int8 is nearly free for us specifically

Here's the part that's particular to this project. Quantization trades a little numerical precision for size and speed. Normally you weigh that trade carefully. But our models are trained on a small, formal corpus — scripture, schoolbooks, and the printed material we could gather — so their quality is already bounded by a narrow domain, not by the last bits of precision.

The dominant error in our output comes from what the model learned, not from fp32 vs int8. So the quantization loss sits far below the error floor that's already there. We give up a fraction of a BLEU we can't perceive and gain 3.7× smaller, 2–4× faster models that run comfortably on free hardware.

The fp32 originals stay archived — int8 is a serving format, never a replacement. But for keeping 56 languages alive and reachable for the price of a coffee, a narrow-domain model is exactly the kind of model you should quantize.

That's the whole philosophy in one engineering decision: be honest about your limits, and let them make your hard choices easy.