VideoCaptioner — AI Subtitle Pipeline
LLM-powered video subtitle tool: Whisper transcription + AI correction + 99-language translation + styled subtitle export. 13,800+ stars.
Installation agent prête
Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.
npx -y tokrepo@latest install d12d8441-f0da-4d3d-a0c2-0f258b27336f --target codexÀ exécuter après confirmation du plan en dry-run.
What it is
VideoCaptioner is an open-source desktop application that automates the video subtitle workflow. It chains Whisper-based speech recognition, LLM-powered text correction, translation into 99 languages, and styled subtitle export (SRT, ASS, VTT) into a single pipeline. The tool provides a GUI for configuring each stage.
VideoCaptioner targets content creators, video editors, and localization teams who need accurate subtitles across languages. It handles the full lifecycle from raw audio to production-ready subtitle files without manual transcription or separate translation tools.
How it saves time or tokens
Manual subtitle creation involves transcribing audio, correcting recognition errors, translating to target languages, and formatting subtitle files. VideoCaptioner automates all four stages. The LLM correction step catches Whisper's common errors (proper nouns, technical terms, homophones) without human review. Batch processing handles multiple videos sequentially.
How to use
- Download the latest release from GitHub or clone the repo and install dependencies with
pip install -r requirements.txt. - Run
python main.pyto open the GUI. Configure your Whisper model and LLM API settings. - Load a video file, select source and target languages, and start the pipeline.
Example
# Clone and set up
git clone https://github.com/WEIFENG2333/VideoCaptioner.git
cd VideoCaptioner
pip install -r requirements.txt
python main.py
# Pipeline stages:
# 1. Whisper transcribes audio to text
# 2. LLM corrects transcription errors
# 3. Translation to selected languages
# 4. Export as SRT, ASS, or VTT with styling
| Stage | Input | Output |
|---|---|---|
| Transcription | Video/audio file | Raw text with timestamps |
| Correction | Raw transcript | Cleaned transcript |
| Translation | Cleaned text | Multi-language text |
| Export | Translated text | SRT/ASS/VTT files |
Related on TokRepo
- Video AI Tools — AI-powered video production tools
- Content AI Tools — Content creation and processing tools
Common pitfalls
- Whisper accuracy depends heavily on audio quality. Background music, overlapping speakers, and low-quality microphones reduce transcription accuracy.
- LLM correction requires API access (OpenAI or compatible). Without it, you get raw Whisper output which may contain errors for domain-specific vocabulary.
- Translation quality varies by language pair. Common pairs (English-Spanish, English-Chinese) produce better results than less common language combinations.
Questions fréquentes
The large-v3 model provides the best accuracy but requires more GPU memory and processing time. The medium model offers a good balance for most content. For fast processing with acceptable quality, use the small model.
VideoCaptioner supports OpenAI API and compatible endpoints. You can configure any OpenAI-compatible API (including local models via Ollama or LM Studio) for the correction and translation stages.
Yes. VideoCaptioner accepts both video and audio files. For audio-only inputs, it skips the video processing and goes directly to transcription.
VideoCaptioner exports SRT (SubRip), ASS (Advanced SubStation Alpha with styling), and VTT (WebVTT for web video). ASS format supports custom fonts, colors, and positioning.
Yes, when a CUDA-compatible GPU is available. Whisper uses GPU acceleration for faster transcription. The tool falls back to CPU processing if no GPU is detected, but processing time increases significantly.
Sources citées (3)
- VideoCaptioner GitHub— VideoCaptioner combines Whisper transcription, LLM correction, and multi-languag…
- Whisper GitHub— OpenAI Whisper speech recognition model
- Whisper Paper— Automatic speech recognition and translation research
En lien sur TokRepo
Source et remerciements
Created by WEIFENG2333. Licensed under GPL-3.0. VideoCaptioner — ⭐ 13,800+
Fil de discussion
Actifs similaires
Data Juicer — Data Processing Pipeline for Foundation Models
Data Juicer is a data processing toolkit designed for building and curating training datasets for large language models and multimodal models. It provides over 100 composable operators for filtering, deduplication, and quality analysis of text, image, audio, and video data.
Remotion Captions & Subtitles — AI-Powered Video Subtitles
AI skill for generating and rendering captions in Remotion videos. Supports transcription, word-level timing, and styled subtitle export.
Luigi — Python Pipeline Orchestration by Spotify
Luigi is a Python framework for building complex data pipelines with dependency resolution, scheduling, and failure handling built in.
Pachyderm — Data Versioning and Pipeline Orchestration
Version your data like Git, build reproducible data pipelines triggered by commits, and track lineage from raw input to model output on Kubernetes.