TOKREPO · Arsenal IA

Stable

Pipeline de traduction i18n à grande échelle

Dix picks pour l'équipe app qui livre en 10+ langues et qui en a assez de payer un SaaS au string. Pipeline CI : extraire les clés avec Weblate/Tolgee → traduire avec l'OpenAI SDK ou Transformers/LibreTranslate auto-hébergés → glossaire avec Vale → QA grammatical avec LanguageTool → orthographe avec typos → réinjecter. pre-commit et markdownlint tiennent la ligne à chaque PR.

10 ressources

À propos de ce pack

What's in this pack

This is the stack for the app team shipping in 10+ languages without a dedicated localization vendor — a backend engineer, a frontend engineer, and a part-time PM who all share the on-call rotation and cannot afford a per-string SaaS bill that scales linearly with the product. The job stops being "translate strings" and starts being "keep translations in sync with main on every PR without a human in the loop for the bulk path".

It is not the same job as the one in the translator's multi-lingual stack. That pack is for the human translator and localization engineer running a real pipeline with translators in the loop. This pack is for the dev team that wants to automate that pipeline end-to-end in CI and only escalate to a human reviewer when a gate fails. Same five stages — extract, translate, QA, validate, reinject — but the picks change because the operator changes.

The difference shows up in the tool choices. We keep Weblate and Tolgee because every pipeline still needs a TMS, but we add pre-commit (the CI gate orchestrator), typos (CI-friendly spell-checker), markdownlint (so the translated .md files don't break the docs build), the openai-python SDK (the LLM caller in your translation script), and transformers (so you can fine-tune or host an NMT model on your own GPU without an API bill). The bulk-translate path becomes code, not clicks.

Install in this order (extract → AI translate → QA gate → reinject)

Weblate — the TMS that holds the source of truth. Start here because every other tool either feeds it or reads from it. Weblate watches your git repo, extracts strings from gettext/xliff/json/Android XML/properties, and pushes completed translations back as commits. Self-hosted on Docker, configured against your existing GitHub or GitLab repo, this is the foundation.
Tolgee — the developer-friendly alternative. Pick Tolgee instead of Weblate if your reviewers are PMs and designers who need to see strings in-context on the running app (alt-click). Pick Weblate if your reviewers live in PRs. Most teams pick one and stick with it for years; both are listed because the right answer depends on who reviews translations.
LibreTranslate — the self-hosted NMT engine for the bulk path. Wire this into Weblate's automatic-suggestion backend so every new string gets a machine translation before a human ever sees it. No per-token cost, no rate limit, no compliance review for sending pre-launch strings to a third-party SaaS. The first 80% of UI strings ship through LibreTranslate without further escalation.
Hugging Face Transformers — when LibreTranslate's Argos model isn't fluent enough for your target languages and you need to fine-tune. Load NLLB-200 or M2M-100, fine-tune on your existing translation memory (export it as a TMX file from Weblate), and serve from your own GPU. This is the escape hatch for low-resource languages and post-edit-heavy locales where the off-the-shelf NMT loses fluency.
openai-python (or any LLM SDK) — the context-aware translator for strings that have to read like a human wrote them. Marketing copy, error messages users see in their language, onboarding screens. Your translation script reads the source string + screenshot URL + glossary + the last 3 translations of similar strings, builds a prompt, calls the LLM, and writes the result back to Weblate. Always pass the glossary in the prompt. Always.
Vale — the terminology gate. Configure a rule pack with your forbidden terms (login → sign in), brand terms that must never be translated (Pull Request, Slack), and tone constraints per locale (formal Sie in German, informal tu in French marketing). Vale runs on every PR via pre-commit. A glossary violation fails the build. No exceptions, no soft warnings.
LanguageTool — the grammar and style gate. Run it on the translated output, not the source. Catches the silent class of bugs where the translation is grammatically wrong in ways a non-speaker reviewer would never notice — German case, French agreement, Spanish ser/estar, Russian plural forms. Self-hosted as an HTTP API in your CI cluster.
typos — the spell-check gate. Rust-fast, ships as a single binary, runs in pre-commit. Catches the recieve / recieved / seperator class of bugs that survive LLM translation because the LLM was trained on the internet, which contains those typos at scale. Configure a per-locale dictionary for product names and you're done.
markdownlint — the structure gate for translated docs. When your README.md ships in 10 locales, you cannot have one locale's translation silently break the heading hierarchy, mismatch a list indent, or close a code fence in the wrong place. markdownlint catches all three. Run it on every translated .md in CI.
pre-commit — the orchestrator that wires the four gates together. One .pre-commit-config.yaml runs typos + Vale + LanguageTool + markdownlint on every staged file before commit and again in CI. If any gate fails, the commit fails, the PR fails, nothing reinjects. This is the file that turns the stack from "a pile of tools we ran once" into "a pipeline that holds the line on every PR".

How they fit together (CI-driven pipeline)

  source content (po / xliff / json / Android XML / md)
        │
        ▼
  ┌──── Weblate (or Tolgee) ─────┐
  │   extract on git push        │
  │   ─────────────────────────  │
  │   present strings via REST   │
  └──────────────┬───────────────┘
                 ▼
     ┌──── translation script ────┐
     │  for each string:          │
     │   • lookup translation memory │
     │   • build prompt (glossary + screenshot) │
     │   • route by string type:  │
     │      marketing → OpenAI SDK│
     │      UI bulk   → LibreTranslate │
     │      hard lang → Transformers (fine-tuned) │
     │   • write back to Weblate  │
     └──────────────┬─────────────┘
                    ▼
        ┌──── pre-commit gates ────┐
        │  Vale       (glossary)   │
        │  LanguageTool (grammar)  │
        │  typos      (spelling)   │
        │  markdownlint (.md shape)│
        │   ANY fail = PR fails    │
        └──────────────┬───────────┘
                       ▼
           reinject via Weblate commit → git → build

The gate row is the load-bearing piece. Without pre-commit orchestrating the four checkers, glossary drift, grammar bugs, spelling typos, and broken markdown all leak into production on different days through different paths. With it, every PR either passes all four or doesn't merge.

Tradeoffs you'll hit

OpenAI SDK vs self-hosted Transformers vs LibreTranslate — these three are different cost/quality/privacy points. LLM via OpenAI SDK is highest quality for context-sensitive strings (marketing, errors) and costs cents per thousand strings; LibreTranslate is free and runs in your VPC but loses fluency on low-resource languages; Transformers fine-tuned on your own TM is the escape hatch when neither works. The production pattern: route by string type, not by tool. Marketing copy → LLM. Bulk UI → LibreTranslate. Locales where LibreTranslate is bad → fine-tuned NMT via Transformers.
Weblate vs Tolgee vs SaaS (Lokalise/Crowdin/Phrase) — SaaS ships faster but locks you in and prices per string. With a 50,000-string app in 12 locales, the math gets unfriendly fast. Weblate is the right default for teams whose reviewers live in PRs; Tolgee is the right default for teams whose reviewers need in-context editing. Stick with SaaS only when integrations you genuinely use justify the bill.
CI gates as soft warnings vs hard failures — soft warnings get ignored. Hard failures cause occasional drama when a translation is genuinely fine and the gate is wrong. The right answer is hard failures with a documented override path: an engineer adds a # vale-ignore: TermsCheck comment with a code review justification, the PR proceeds, the override gets audited weekly. Never run the gates as advisory.
Translation memory ownership — your TM is more sensitive than your code. It contains every release note before launch, every customer support reply, every legal disclaimer. Self-host the TMS (Weblate or Tolgee) and only send strings to a third-party LLM after PII and embargoed-content redaction. The LibreTranslate + self-hosted Transformers path exists for exactly this reason.

Common pitfalls

Placeholder breakage — the LLM helpfully translates {username} to {nombreusuario} and the app crashes on next render. Turn on Weblate's placeholder check; configure your translation script to lock placeholders before the LLM call and substitute them back after.
Forgetting to pass the glossary in the LLM prompt — the single most common bug in homegrown translation scripts. Without the glossary, the LLM picks a new word for "workspace" every call. The fix is one line in the prompt template; do not skip it.
Routing by language instead of by string type — "all French goes through LLM" sounds cheap until your French marketing copy reads like a robot. Route by string type: marketing → LLM, UI bulk → NMT, regardless of language.
Treating CI gates as advisory — the moment one engineer overrides a Vale failure without code review, the gate is dead. Either it's a hard fail or it's nothing.
Skipping markdownlint on translated docs — translated README.md files break the docs build at 2am because a Spanish translator put a * where a - was. markdownlint is the cheapest insurance in this pack; turn it on first.
No human-in-the-loop sampler — a fully-automated pipeline drifts. Sample 1% of merged translations into a manual review queue weekly. The metrics from that sample tell you which gates need tuning and which locales need a Transformers fine-tune.

INSTALLER · UNE COMMANDE

$ tokrepo install pack/i18n-translation-pipeline-scale

passez-la à votre agent — ou collez-la dans votre terminal

Ce qu'il contient

10 ressources prêtes à installer

Skill#01

Weblate — Web-Based Continuous Localization Platform

A web-based translation management system with tight version control integration. Weblate automates the localization workflow with translation memory, machine translation, and quality checks.

by AI Open Source·224 views

$ tokrepo install weblate-web-based-continuous-localization-platform-cb2ceff8

Skill#02

Tolgee — Developer-Friendly Localization Platform

An open-source localization platform that lets developers and translators manage translations through a web UI, in-context editing, and native SDK integrations for React, Vue, Angular, and more.

by AI Open Source·303 views

$ tokrepo install tolgee-developer-friendly-localization-platform-5b96a366

Skill#03

LibreTranslate — Self-Hosted Translation API with No Rate Limits

LibreTranslate is a self-hostable translation API powered by open-source Argos Translate models. No API keys, no rate limits, no data sent to third parties — a drop-in replacement for Google Translate when privacy matters.

by AI Open Source·341 views

$ tokrepo install libretranslate-self-hosted-translation-api-no-rate-limits-3109a712

Skill#04

Hugging Face Transformers — The Universal Library for Pretrained Models

transformers is the de-facto Python library for using and fine-tuning pretrained models — BERT, GPT, Llama, Whisper, ViT, and 250,000+ others. One unified API works across PyTorch, TensorFlow, and JAX.

by Hugging Face·260 views

$ tokrepo install hugging-face-transformers-universal-library-pretrained-b0920ac9

Skill#05

openai-python — Official OpenAI Python SDK

Call the OpenAI REST API from Python 3.9+ with typed request/response models and sync/async clients. Use it as a core SDK for agents and app backends.

by Agent Toolkit·206 views

$ tokrepo install openai-python-official-openai-python-sdk

Skill#06

Vale — Syntax-Aware Prose Linter for Technical Writing

Vale is a command-line tool that enforces writing style guides on your prose, supporting custom rules for documentation teams to ensure consistent terminology, tone, and formatting across Markdown, AsciiDoc, and more.

by AI Open Source·145 views

$ tokrepo install vale-syntax-aware-prose-linter-technical-writing-13b1fee7

Skill#07

LanguageTool — Self-Hosted Grammar and Style Checker for 25+ Languages

An open-source grammar, style, and spell checker that supports over 25 languages and can be self-hosted as an HTTP API server for private proofreading.

by Script Depot·284 views

$ tokrepo install languagetool-self-hosted-grammar-style-checker-25-languages-29fd01ff

Script#08

typos — Source Code Spell Checker for CI

typos catches spelling mistakes in code, docs, config, and comments with low false positives. Run it locally, in pre-commit, or as a CI gate.

by crate-ci·48 views

$ tokrepo install typos-source-code-spell-checker-for-ci

Skill#09

Markdownlint — Lint Markdown for AI Content Quality

Node.js markdown linter with 50+ rules. Ensure consistent formatting in CLAUDE.md, .cursorrules, README files, and AI-generated documentation across your project.

by Script Depot·339 views

$ tokrepo install markdownlint-lint-markdown-ai-content-quality-2f24f820

Skill#10

pre-commit — A Framework for Managing Git Hook Scripts

pre-commit manages and installs multi-language Git hooks from a YAML file. It runs linters, formatters, and checks before commits reach CI — catching issues early with zero manual setup per developer.

by Script Depot·247 views

$ tokrepo install pre-commit-framework-managing-git-hook-scripts-69a51c48

Questions fréquentes

How is this pack different from the Translator's Multi-Lingual Stack?

Different operator, different framing. The translator pack is built for a human localization engineer running a pipeline with translators in the loop — Weblate, glossary owner, post-edit workflow, format-aware tools for PDF and video. This pack is built for the dev team that wants no human in the bulk path: pre-commit orchestrates Vale, LanguageTool, typos, and markdownlint on every PR; the openai-python SDK or self-hosted Transformers do the bulk translation; humans only see strings the gates rejected. Same TMS layer (Weblate, Tolgee, LibreTranslate appear in both because they're the right answer for both jobs), different automation layer.

Why three translation engines instead of one?

Because no single engine is right across the cost-quality-privacy space. OpenAI's API gives you a context-aware LLM that knows {user_name} is a placeholder and that 'trial' in a SaaS app means free-trial, not court case — but you pay per token and send strings to a third party. LibreTranslate runs free in your VPC with no rate limit but is less fluent on low-resource languages. Hugging Face Transformers lets you fine-tune NLLB-200 or M2M-100 on your own translation memory and host it on your own GPU — best for the locales where LibreTranslate is bad and OpenAI is expensive. The production pattern is to route by string type, not by language: marketing through the LLM, bulk UI through LibreTranslate, hard locales through fine-tuned Transformers.

Do I really need both Vale and LanguageTool in the pipeline?

Yes — they catch different bug classes. LanguageTool is a grammar checker: it knows German case agreement, French gender agreement, Spanish ser/estar, Russian plural forms, the things a non-native reviewer would never spot. Vale is a style and terminology linter: it enforces your glossary (never say login, always sign in), brand terms (never translate Pull Request), and tone constraints per locale. LanguageTool catches grammar drift, Vale catches policy drift. Running only one of them leaves the other class of bugs in production. Both are cheap to run in CI.

What's the smallest viable version of this pipeline I can ship this week?

Four picks. Weblate (Docker, one afternoon against your git repo). LibreTranslate (one container, wired as Weblate's MT suggestion backend). pre-commit running typos + markdownlint (one .pre-commit-config.yaml, ten minutes). And the openai-python SDK in a 100-line translation script that reads Weblate's REST API, calls the LLM for any string tagged 'marketing' with the glossary in the prompt, and writes back. That's the v1: bulk pre-translation by NMT, marketing strings by LLM, two CI gates holding the line. Add Vale + LanguageTool the second week to catch what slipped through. Add Transformers fine-tuning only when you can prove a specific locale needs it.

How do I avoid sending sensitive translation memory to a third-party LLM?

Three layers. First, route by string sensitivity: anything tagged confidential or pre-launch in Weblate routes to LibreTranslate or your self-hosted Transformers, never to the OpenAI API. Second, run a PII redactor before any LLM call — replace user names, emails, customer IDs with placeholders, swap them back after translation. Third, sign a DPA with your LLM provider and document the data flow in your security review. The pack lists LibreTranslate and Transformers ahead of openai-python for exactly this reason: the self-hosted path is the default, the LLM is the escape hatch for the strings where context wins, not the bulk hammer.

PLUS DANS L'ARSENAL

12 packs · 80+ ressources sélectionnées

Découvrez tous les packs curatés sur la page d'accueil

Retour à tous les packs