SkillsMar 29, 2026·1 min read

Remotion Rule: Transcribe Captions

Remotion skill rule: Transcribing audio to generate captions in Remotion. Part of the official Remotion Agent Skill for programmatic video in React.

TL;DR
Use @remotion/install-whisper-cpp to transcribe audio and generate timed captions for Remotion videos.
§01

What it is

This is a Remotion skill rule for transcribing audio files to generate timed captions in programmatic video projects. It uses the @remotion/install-whisper-cpp package, which bundles Whisper.cpp for local speech-to-text transcription without sending audio to external APIs.

The rule is part of the official Remotion Agent Skill collection. It targets developers building React-based videos with Claude Code, Cursor, or OpenAI Codex who need automatic subtitles.

§02

How it saves time or tokens

Manual caption creation is tedious and error-prone. This rule teaches AI coding assistants the exact sequence: install whisper-cpp, download a model, transcribe audio, and convert the output to Remotion's caption format. Without the rule, an assistant might suggest external transcription APIs or generate incorrect package imports. The rule ensures the assistant produces working transcription code on the first attempt.

§03

How to use

  1. Install the Remotion skills collection:
npx skills add remotion-dev/skills
  1. Add the whisper-cpp package to your Remotion project:
npx remotion add @remotion/install-whisper-cpp
  1. Create a Node.js script that downloads the model, transcribes audio, and outputs caption data.
§04

Example

import path from 'path';
import {
  downloadWhisperModel,
  installWhisperCpp,
  transcribe,
  toCaptions,
} from '@remotion/install-whisper-cpp';

const whisperPath = path.join(process.cwd(), 'whisper.cpp');
await installWhisperCpp({ to: whisperPath });
await downloadWhisperModel({
  model: 'medium.en',
  folder: whisperPath,
});

const result = await transcribe({
  inputPath: 'src/audio/narration.wav',
  whisperPath,
  model: 'medium.en',
  tokenLevelTimestamps: true,
});

const captions = toCaptions({ transcription: result });
console.log(JSON.stringify(captions, null, 2));
§05

Related on TokRepo

This tool integrates with standard development workflows and requires minimal configuration to get started. It is available as open-source software with documentation and community support through the official repository. The project follows semantic versioning for stable releases.

For teams evaluating this tool, the key advantage is reducing manual work in repetitive tasks. The automation provided by the built-in features means less custom code to maintain and fewer integration points to manage. This translates directly to lower maintenance costs and faster iteration cycles.

§06

Common pitfalls

  • The medium.en model is English-only but faster; use medium or large for multilingual transcription at the cost of slower processing.
  • Whisper.cpp compiles native code on first install, so your system needs a C++ toolchain (Xcode CLI tools on macOS, build-essential on Linux).
  • Token-level timestamps (tokenLevelTimestamps: true) are required for word-by-word caption rendering; without them you only get segment-level timing.

Frequently Asked Questions

What audio formats does Remotion Whisper transcription support?+

The transcribe function accepts WAV files. If your audio is in MP3 or AAC format, convert it to WAV first using ffmpeg. Remotion's own audio extraction tools can also produce WAV output from video files.

Can I use a GPU to speed up transcription?+

Whisper.cpp supports GPU acceleration via Metal on macOS and CUDA on Linux. The installWhisperCpp function compiles with available GPU support automatically. GPU transcription is significantly faster for longer audio files.

How accurate are the generated captions?+

Accuracy depends on the model size. The medium.en model provides good accuracy for clear English speech. The large model improves accuracy for accented speech and noisy audio but takes longer to process.

Do I need an API key for transcription?+

No. Whisper.cpp runs entirely locally on your machine. There are no API calls, no rate limits, and no costs beyond compute time. Your audio never leaves your system.

Can I edit captions after transcription?+

Yes. The toCaptions function returns a JSON array of timed caption objects. You can filter, merge, or modify entries programmatically before passing them to your Remotion composition for rendering.

Citations (3)
🙏

Source & Thanks

Created by Remotion. Licensed under MIT. remotion-dev/skills — Rule: transcribe-captions

Part of the Remotion AI Skill collection on TokRepo.

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets