Stack multilingue du traducteur
Dix picks pour le localization engineer, traducteur ou PM i18n avec un vrai pipeline : extraire → glossaire → traduire (LLM + NMT) → QA terminologique → réinjecter. Weblate/Tolgee auto-hébergé, LibreTranslate en repli, Vale + LanguageTool pour le QA, PDFMathTranslate et KrillinAI pour les formats coriaces.
What's in this pack
This is the stack for the working localization engineer, translator, or i18n PM — the person who has to ship a po/xliff/json file that doesn't break the build, doesn't lose the placeholder {username}, and doesn't translate "trial" as "court case" because the LLM forgot it was a SaaS app.
It is not a single magic translator. The job has five stages — extract, build a glossary, translate, QA, reinject — and each stage has its own tool with its own failure mode. The pack picks one default for each stage, plus one fallback for the two stages where defaults bite (the translation engine itself, and the formats that don't fit a TMS).
Everything below is self-hostable. That matters: most of the translation memory in a real company is also legal-sensitive (contracts, release notes before embargo, customer support tickets). Sending it to a third-party MT vendor without thought is how you end up on a compliance call.
Install in this order (extract → glossary → translate → QA → reinject)
- Weblate — the TMS. Start here, because every other tool plugs into it. Weblate watches your git repo, extracts strings from gettext/xliff/json/properties, hands them to translators (or to an MT engine), and pushes the result back as a commit. Self-hosted on Docker in an afternoon.
- Tolgee — the developer-friendly alternative. Pick Tolgee instead of Weblate if your translators are non-technical and need an in-context editor (the Tolgee SDK lets them alt-click a string on the running app). Most teams pick one or the other, not both — Weblate for git-native engineering teams, Tolgee for product-led teams.
- LibreTranslate — the NMT engine. Self-hosted Argos models, no API keys, no rate limits. This is your fallback when the LLM is too slow, too expensive, or refuses to translate something. Wire it in as Weblate's automatic-suggestion backend.
- Fairseq — when you need to actually train or fine-tune an NMT model on your in-house corpus. Most teams won't go here. The ones that do (regulated industries, low-resource languages, post-edit-heavy workflows) cannot avoid it. Knowing it exists is half the battle.
- Claude / GPT-4 class LLM (via your IDE or API) — the context-aware translator. Use it for the strings that have to read like a human wrote them: marketing copy, error messages users see, onboarding. Always pass the glossary and surrounding context in the prompt. Always.
- NLTK — the Python NLP toolkit. You'll use it for the unglamorous middle-stage work: tokenizing strings, sentence-splitting, extracting candidate terms for the glossary, computing BLEU/chrF on your translation outputs. Not a translator. The duct tape.
- LanguageTool — grammar + style QA across 25+ languages. Run it on the translated output, not the source. Catches the silent class of bugs where the translation is grammatically wrong in ways a non-speaker reviewer would never notice (German case, French agreement, Spanish ser/estar).
- Vale — prose linter with custom rule packs. This is your terminology enforcer: forbid "login" when style guide says "sign in", forbid translating "Pull Request" at all, flag forbidden tone in marketing locales. Pairs with LanguageTool: Vale catches policy violations, LanguageTool catches grammar.
- PDFMathTranslate — translate PDFs while preserving layout, math, and figures. The thing every other PDF translator gets wrong. Critical for technical docs, academic papers, regulatory submissions that have to round-trip through PDF.
- Whisper + KrillinAI — when the source is spoken. Whisper extracts the transcript with timestamps; KrillinAI handles the full video-to-100-language pipeline including dubbing if you need it. Use these only when the source is video — they're the escape hatch for the format the TMS cannot touch.
How they fit together (translation pipeline)
source content (po / xliff / json / md / pdf / video)
│
▼
┌──── Weblate (or Tolgee) ────┐
│ extract strings + segment │
│ ───────────────────────── │
│ for PDF → PDFMathTranslate │
│ for video → Whisper + Krillin │
└──────────────┬───────────────┘
▼
glossary (TMX/CSV)
maintained by terminologist
│
▼
┌──── translate ───┐
│ ╱ ╲ │
│ LLM (context- LibreTranslate │
│ aware copy) (bulk + fallback) │
│ ╲ ╱ │
│ Fairseq (if fine-tuning) │
└──────┬───────────┘
▼
┌──── QA gate ────┐
│ Vale (terms) │
│ AND │
│ LanguageTool (grammar) │
│ AND │
│ NLTK (BLEU/chrF score) │
└──────┬───────────┘
▼
reinject via Weblate commit → git → build
The gate that matters is the QA gate: nothing reaches the reinject step until Vale + LanguageTool both pass on the translated string and the glossary report shows zero forbidden-term hits. Without that gate, LLM context loss eats you alive at scale.
Tradeoffs you'll hit
- LLM vs Google Translate vs DeepL vs LibreTranslate — LLMs win on context (they know
{user_name}is a placeholder, not a word to translate). DeepL wins on fluency for EU languages. Google has the widest language coverage. LibreTranslate wins on cost + privacy because it runs in your VPC. Production stacks route by string: marketing copy → LLM, UI strings → DeepL or NMT, internal docs → LibreTranslate. - Glossary strictness — too strict and the translation reads like a robot translated it. Too loose and "sign in" / "log in" / "login" all coexist in one product. The middle path: hard-enforce on brand terms and legal terms, soft-suggest on style choices, let the reviewer override with a comment.
- Weblate vs Tolgee vs Lokalise/Crowdin — the hosted SaaS options (Lokalise, Crowdin, Phrase) ship faster but lock you in and price per string. Weblate is the open-source default if your translators live in git. Tolgee is the open default if they need in-context editing. Skip SaaS unless you genuinely need the integrations and don't mind the bill.
- MT pre-translate, then post-edit, vs human-only — for 90% of strings, MT + post-edit is 3-5x faster and reaches identical quality. The 10% where it fails (legal, brand voice, jokes, cultural references) you flag up front with a tag in the source file and route human-only. Pretending the entire app is in the 10% is how localization costs explode.
Common pitfalls
- Context loss — translating
Savewithout knowing whether it's a button, a feature, or the past tense of "saved". The fix: include screenshot URLs, the surrounding sentence, and the UI component type in the translation memory. Tolgee does this natively; Weblate needs a screenshot plugin. - Format breakage — losing the
{count}placeholder, breaking the<a href="...">HTML, or removing the trailing newline. Always validate the shape of the translated string against the source before commit. Weblate has placeholder checks; turn them on. - Glossary drift — three translators each chose a different word for "workspace" over six months. Run a weekly Vale + glossary audit; treat a glossary violation as a CI failure, not a soft warning.
- Pluralization — English has 2 plural forms, Russian has 3, Arabic has 6, Chinese has 1. Use ICU MessageFormat from day 1; don't string-concatenate plurals.
- RTL languages — Arabic and Hebrew don't just flip text; they flip layout, parentheses, and punctuation rules. Test RTL in QA, not in production.
- Sending TMs to third-party LLMs without redaction — your translation memory contains every release note before launch and every customer support ticket. Redact PII and embargoed content before it leaves your VPC. LibreTranslate exists for exactly this reason.
10 ressources prêtes à installer
Questions fréquentes
Do I really need both Weblate and Tolgee?
No — pick one. Weblate is the right default if your translators are comfortable with git and you want the whole pipeline to feel like normal engineering (PRs, commit history, CI gates). Tolgee is the right default if your translators are non-technical and you want them editing strings in-context on the running app via alt-click. Most teams that try both end up retiring the one that doesn't match how their translators actually work. The pack lists both because the choice depends on your team, not on the technology.
Why include LibreTranslate when an LLM like Claude can translate anything?
Three reasons. First, cost: LibreTranslate runs on your hardware with no per-token billing; for a bulk pre-translation pass on a 50,000-string product, the cost difference is large. Second, latency: LibreTranslate returns in milliseconds, LLMs in seconds. Third, privacy: your translation memory often contains pre-launch product details, contracts, and customer data — keeping that in your own VPC matters. The pattern that works is to LLM the strings where context wins (marketing, errors), and NMT the strings where consistency and cost win (bulk UI, internal docs).
How is Vale different from LanguageTool — aren't they both linters?
They solve different layers. LanguageTool is a grammar checker: it knows German cases, French agreement, Spanish ser/estar — the things a translator might get wrong because the target language is hard. Vale is a style and terminology linter: it enforces your style guide ("never say login, always sign in", "never translate Pull Request"). You want both. LanguageTool catches grammar drift; Vale catches policy drift. Running only one of them leaves a class of bugs unprotected.
What's the smallest possible version of this pack I can run this week?
Three picks: Weblate (Docker, one afternoon to stand up against your git repo), LibreTranslate (Docker, one container, wire it as Weblate's MT suggestion engine), and Vale (CLI, one config file with your forbidden-terms list). With those three you have extract → MT pre-translate → terminology gate → commit. Add LanguageTool the following week to catch the grammar bugs LibreTranslate doesn't. Add an LLM pass for marketing copy after that. The rest of the pack you add only when you hit a format the three-tool baseline cannot handle.
How do I keep translators productive when context is split between source code, screenshots, glossary, and the TMS?
The single biggest lever is putting the screenshot of the actual UI in the translation memory, scoped to the string. Tolgee does this with one click; Weblate needs the screenshot plugin and a small CI job that uploads a Storybook screenshot per component. Once translators can see what they're translating, throughput on UI strings goes up roughly 30-50% and you stop getting bug reports about "this button doesn't fit in German". The second lever is auto-loading the last 5 translations of similar strings as context — most TMS systems do this; turn it on.
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs