TOKREPO · Arsenal de IA
Nuevo · esta semana

Stack Multilingüe del Traductor

Diez picks para el localization engineer, traductor o PM de i18n con un pipeline real: extraer → glosario → traducir (LLM + NMT) → QA terminológico → reinyectar. Weblate/Tolgee autohospedado, LibreTranslate como fallback, Vale + LanguageTool para QA, PDFMathTranslate y KrillinAI para formatos difíciles.

10 recursos

What's in this pack

This is the stack for the working localization engineer, translator, or i18n PM — the person who has to ship a po/xliff/json file that doesn't break the build, doesn't lose the placeholder {username}, and doesn't translate "trial" as "court case" because the LLM forgot it was a SaaS app.

It is not a single magic translator. The job has five stages — extract, build a glossary, translate, QA, reinject — and each stage has its own tool with its own failure mode. The pack picks one default for each stage, plus one fallback for the two stages where defaults bite (the translation engine itself, and the formats that don't fit a TMS).

Everything below is self-hostable. That matters: most of the translation memory in a real company is also legal-sensitive (contracts, release notes before embargo, customer support tickets). Sending it to a third-party MT vendor without thought is how you end up on a compliance call.

Install in this order (extract → glossary → translate → QA → reinject)

  1. Weblate — the TMS. Start here, because every other tool plugs into it. Weblate watches your git repo, extracts strings from gettext/xliff/json/properties, hands them to translators (or to an MT engine), and pushes the result back as a commit. Self-hosted on Docker in an afternoon.
  2. Tolgee — the developer-friendly alternative. Pick Tolgee instead of Weblate if your translators are non-technical and need an in-context editor (the Tolgee SDK lets them alt-click a string on the running app). Most teams pick one or the other, not both — Weblate for git-native engineering teams, Tolgee for product-led teams.
  3. LibreTranslate — the NMT engine. Self-hosted Argos models, no API keys, no rate limits. This is your fallback when the LLM is too slow, too expensive, or refuses to translate something. Wire it in as Weblate's automatic-suggestion backend.
  4. Fairseq — when you need to actually train or fine-tune an NMT model on your in-house corpus. Most teams won't go here. The ones that do (regulated industries, low-resource languages, post-edit-heavy workflows) cannot avoid it. Knowing it exists is half the battle.
  5. Claude / GPT-4 class LLM (via your IDE or API) — the context-aware translator. Use it for the strings that have to read like a human wrote them: marketing copy, error messages users see, onboarding. Always pass the glossary and surrounding context in the prompt. Always.
  6. NLTK — the Python NLP toolkit. You'll use it for the unglamorous middle-stage work: tokenizing strings, sentence-splitting, extracting candidate terms for the glossary, computing BLEU/chrF on your translation outputs. Not a translator. The duct tape.
  7. LanguageTool — grammar + style QA across 25+ languages. Run it on the translated output, not the source. Catches the silent class of bugs where the translation is grammatically wrong in ways a non-speaker reviewer would never notice (German case, French agreement, Spanish ser/estar).
  8. Vale — prose linter with custom rule packs. This is your terminology enforcer: forbid "login" when style guide says "sign in", forbid translating "Pull Request" at all, flag forbidden tone in marketing locales. Pairs with LanguageTool: Vale catches policy violations, LanguageTool catches grammar.
  9. PDFMathTranslate — translate PDFs while preserving layout, math, and figures. The thing every other PDF translator gets wrong. Critical for technical docs, academic papers, regulatory submissions that have to round-trip through PDF.
  10. Whisper + KrillinAI — when the source is spoken. Whisper extracts the transcript with timestamps; KrillinAI handles the full video-to-100-language pipeline including dubbing if you need it. Use these only when the source is video — they're the escape hatch for the format the TMS cannot touch.

How they fit together (translation pipeline)

  source content (po / xliff / json / md / pdf / video)
        │
        ▼
   ┌──── Weblate (or Tolgee) ────┐
   │   extract strings + segment │
   │   ─────────────────────────  │
   │   for PDF → PDFMathTranslate │
   │   for video → Whisper + Krillin │
   └──────────────┬───────────────┘
                  ▼
           glossary (TMX/CSV)
         maintained by terminologist
                  │
                  ▼
     ┌──── translate ───┐
     │  ╱             ╲ │
     │ LLM (context-  LibreTranslate │
     │  aware copy)   (bulk + fallback) │
     │   ╲           ╱  │
     │   Fairseq (if fine-tuning) │
     └──────┬───────────┘
            ▼
      ┌──── QA gate ────┐
      │ Vale (terms) │
      │   AND        │
      │ LanguageTool (grammar) │
      │   AND        │
      │ NLTK (BLEU/chrF score) │
      └──────┬───────────┘
             ▼
     reinject via Weblate commit → git → build

The gate that matters is the QA gate: nothing reaches the reinject step until Vale + LanguageTool both pass on the translated string and the glossary report shows zero forbidden-term hits. Without that gate, LLM context loss eats you alive at scale.

Tradeoffs you'll hit

  • LLM vs Google Translate vs DeepL vs LibreTranslate — LLMs win on context (they know {user_name} is a placeholder, not a word to translate). DeepL wins on fluency for EU languages. Google has the widest language coverage. LibreTranslate wins on cost + privacy because it runs in your VPC. Production stacks route by string: marketing copy → LLM, UI strings → DeepL or NMT, internal docs → LibreTranslate.
  • Glossary strictness — too strict and the translation reads like a robot translated it. Too loose and "sign in" / "log in" / "login" all coexist in one product. The middle path: hard-enforce on brand terms and legal terms, soft-suggest on style choices, let the reviewer override with a comment.
  • Weblate vs Tolgee vs Lokalise/Crowdin — the hosted SaaS options (Lokalise, Crowdin, Phrase) ship faster but lock you in and price per string. Weblate is the open-source default if your translators live in git. Tolgee is the open default if they need in-context editing. Skip SaaS unless you genuinely need the integrations and don't mind the bill.
  • MT pre-translate, then post-edit, vs human-only — for 90% of strings, MT + post-edit is 3-5x faster and reaches identical quality. The 10% where it fails (legal, brand voice, jokes, cultural references) you flag up front with a tag in the source file and route human-only. Pretending the entire app is in the 10% is how localization costs explode.

Common pitfalls

  • Context loss — translating Save without knowing whether it's a button, a feature, or the past tense of "saved". The fix: include screenshot URLs, the surrounding sentence, and the UI component type in the translation memory. Tolgee does this natively; Weblate needs a screenshot plugin.
  • Format breakage — losing the {count} placeholder, breaking the <a href="..."> HTML, or removing the trailing newline. Always validate the shape of the translated string against the source before commit. Weblate has placeholder checks; turn them on.
  • Glossary drift — three translators each chose a different word for "workspace" over six months. Run a weekly Vale + glossary audit; treat a glossary violation as a CI failure, not a soft warning.
  • Pluralization — English has 2 plural forms, Russian has 3, Arabic has 6, Chinese has 1. Use ICU MessageFormat from day 1; don't string-concatenate plurals.
  • RTL languages — Arabic and Hebrew don't just flip text; they flip layout, parentheses, and punctuation rules. Test RTL in QA, not in production.
  • Sending TMs to third-party LLMs without redaction — your translation memory contains every release note before launch and every customer support ticket. Redact PII and embargoed content before it leaves your VPC. LibreTranslate exists for exactly this reason.
INSTALAR · UN COMANDO
$ tokrepo install pack/translator-multilingual-stack
pásalo a tu agente — o pégalo en tu terminal
Qué incluye

10 recursos listos para instalar

Skill#01
Weblate — Web-Based Continuous Localization Platform

A web-based translation management system with tight version control integration. Weblate automates the localization workflow with translation memory, machine translation, and quality checks.

by AI Open Source·120 views
$ tokrepo install weblate-web-based-continuous-localization-platform-cb2ceff8
Skill#02
Tolgee — Developer-Friendly Localization Platform

An open-source localization platform that lets developers and translators manage translations through a web UI, in-context editing, and native SDK integrations for React, Vue, Angular, and more.

by AI Open Source·105 views
$ tokrepo install tolgee-developer-friendly-localization-platform-5b96a366
Skill#03
LibreTranslate — Self-Hosted Translation API with No Rate Limits

LibreTranslate is a self-hostable translation API powered by open-source Argos Translate models. No API keys, no rate limits, no data sent to third parties — a drop-in replacement for Google Translate when privacy matters.

by AI Open Source·197 views
$ tokrepo install libretranslate-self-hosted-translation-api-no-rate-limits-3109a712
Skill#04
PDFMathTranslate — Translate PDF Papers Preserving Format

Translate PDF scientific papers while preserving math formulas, charts, and layout. Supports Google, DeepL, OpenAI, Ollama. CLI, GUI, MCP, Docker, Zotero plugin.

by Script Depot·234 views
$ tokrepo install pdfmathtranslate-translate-pdf-papers-preserving-format-4c628f43
Skill#05
Fairseq — Sequence Modeling Toolkit by Meta

Facebook AI Research sequence modeling toolkit for training custom models in translation, summarization, language modeling, and other text generation tasks.

by Script Depot·119 views
$ tokrepo install fairseq-sequence-modeling-toolkit-meta-834cabb9
Skill#06
NLTK — Natural Language Processing Toolkit for Python

NLTK (Natural Language Toolkit) is the foundational Python library for computational linguistics, providing tokenizers, parsers, classifiers, and corpora used in NLP education and research since 2001.

by AI Open Source·101 views
$ tokrepo install nltk-natural-language-processing-toolkit-python-297e4ff3
Skill#07
LanguageTool — Self-Hosted Grammar and Style Checker for 25+ Languages

An open-source grammar, style, and spell checker that supports over 25 languages and can be self-hosted as an HTTP API server for private proofreading.

by Script Depot·165 views
$ tokrepo install languagetool-self-hosted-grammar-style-checker-25-languages-29fd01ff
Skill#08
Vale — Syntax-Aware Prose Linter for Technical Writing

Vale is a command-line tool that enforces writing style guides on your prose, supporting custom rules for documentation teams to ensure consistent terminology, tone, and formatting across Markdown, AsciiDoc, and more.

by AI Open Source·69 views
$ tokrepo install vale-syntax-aware-prose-linter-technical-writing-13b1fee7
Skill#09
Whisper — OpenAI Speech-to-Text

OpenAI's open-source speech recognition model. Transcribe audio/video to text with word-level timestamps in 99 languages. Essential for subtitle generation.

by OpenAI·214 views
$ tokrepo install whisper-openai-speech-text-eb0f9dd6
Skill#10
KrillinAI — AI Video Translation and Dubbing in 100 Languages

An open-source tool that uses LLMs to translate and dub video content into over 100 languages with one-click deployment, optimized for YouTube, TikTok, and other platforms.

by AI Open Source·85 views
$ tokrepo install krillinai-ai-video-translation-dubbing-100-languages-e0ea662e
Preguntas frecuentes

Preguntas frecuentes

Do I really need both Weblate and Tolgee?

No — pick one. Weblate is the right default if your translators are comfortable with git and you want the whole pipeline to feel like normal engineering (PRs, commit history, CI gates). Tolgee is the right default if your translators are non-technical and you want them editing strings in-context on the running app via alt-click. Most teams that try both end up retiring the one that doesn't match how their translators actually work. The pack lists both because the choice depends on your team, not on the technology.

Why include LibreTranslate when an LLM like Claude can translate anything?

Three reasons. First, cost: LibreTranslate runs on your hardware with no per-token billing; for a bulk pre-translation pass on a 50,000-string product, the cost difference is large. Second, latency: LibreTranslate returns in milliseconds, LLMs in seconds. Third, privacy: your translation memory often contains pre-launch product details, contracts, and customer data — keeping that in your own VPC matters. The pattern that works is to LLM the strings where context wins (marketing, errors), and NMT the strings where consistency and cost win (bulk UI, internal docs).

How is Vale different from LanguageTool — aren't they both linters?

They solve different layers. LanguageTool is a grammar checker: it knows German cases, French agreement, Spanish ser/estar — the things a translator might get wrong because the target language is hard. Vale is a style and terminology linter: it enforces your style guide ("never say login, always sign in", "never translate Pull Request"). You want both. LanguageTool catches grammar drift; Vale catches policy drift. Running only one of them leaves a class of bugs unprotected.

What's the smallest possible version of this pack I can run this week?

Three picks: Weblate (Docker, one afternoon to stand up against your git repo), LibreTranslate (Docker, one container, wire it as Weblate's MT suggestion engine), and Vale (CLI, one config file with your forbidden-terms list). With those three you have extract → MT pre-translate → terminology gate → commit. Add LanguageTool the following week to catch the grammar bugs LibreTranslate doesn't. Add an LLM pass for marketing copy after that. The rest of the pack you add only when you hit a format the three-tool baseline cannot handle.

How do I keep translators productive when context is split between source code, screenshots, glossary, and the TMS?

The single biggest lever is putting the screenshot of the actual UI in the translation memory, scoped to the string. Tolgee does this with one click; Weblate needs the screenshot plugin and a small CI job that uploads a Storybook screenshot per component. Once translators can see what they're translating, throughput on UI strings goes up roughly 30-50% and you stop getting bug reports about "this button doesn't fit in German". The second lever is auto-loading the last 5 translations of similar strings as context — most TMS systems do this; turn it on.

MÁS DEL ARSENAL

12 packs · 80+ recursos seleccionados

Explora todos los packs curados en la página principal

Volver a todos los packs