SkillsMar 29, 2026·2 min read

Remotion Captions & Subtitles — AI-Powered Video Subtitles

AI skill for generating and rendering captions in Remotion videos. Supports transcription, word-level timing, and styled subtitle export.

TL;DR
Generate and render word-level captions in Remotion videos with AI transcription.
§01

What it is

Remotion Captions is an AI skill for generating and rendering subtitles in Remotion videos. It handles transcription with word-level timing, styled subtitle rendering, and export within the Remotion React-based video framework. You feed it audio, it produces timed captions that render as animated text overlays in your video.

This skill targets developers building programmatic videos with Remotion who need automated captioning. It is part of the official Remotion Agent Skill set, designed to work within AI-assisted video production workflows.

§02

Why it saves time or tokens

Manually timing subtitles frame-by-frame is one of the most tedious tasks in video production. This skill automates the entire pipeline: transcribe audio, align words to timestamps, and render styled captions. For AI agents generating videos programmatically, this reduces a multi-hour manual process to a single function call.

§03

How to use

  1. Set up a Remotion project with the @remotion/captions package installed
  2. Provide the audio source for transcription (supports Whisper and other transcription backends)
  3. Use the caption components in your Remotion composition to render timed subtitles
§04

Example

import { Subtitle } from '@remotion/captions';
import { useVideoConfig } from 'remotion';

const MyVideo = () => {
  const { fps } = useVideoConfig();
  const captions = [
    { text: 'Hello world', startMs: 0, endMs: 1500 },
    { text: 'This is automated', startMs: 1600, endMs: 3000 },
  ];

  return (
    <div style={{ position: 'relative' }}>
      <Subtitle
        captions={captions}
        fps={fps}
        style={{
          fontSize: 48,
          color: 'white',
          textShadow: '2px 2px 4px black'
        }}
      />
    </div>
  );
};
FeatureDescription
Word-level timingEach word has start/end timestamps
Custom stylingCSS-based font, color, shadow
AnimationFade, pop, highlight effects
Multi-languageWorks with any transcription language
§05

Related on TokRepo

§06

Common pitfalls

  • Transcription accuracy depends on audio quality; noisy backgrounds or multiple speakers reduce word-level timing precision
  • Custom fonts must be loaded before rendering; use the Remotion font loading system or captions will render in the default system font
  • Long videos with many caption segments can slow down the Remotion preview; use composition splitting for videos over 5 minutes

Frequently Asked Questions

What transcription services does Remotion Captions support?+

Remotion Captions works with transcription output from Whisper, Deepgram, and other services that produce word-level timestamps. The caption system consumes a JSON array of timed words, so any transcription backend that outputs start and end times per word is compatible.

Can I customize the subtitle appearance?+

Yes. Captions are rendered as React components with full CSS styling support. You control font size, color, background, shadow, position, and animation. Since Remotion uses React, you can build custom caption components with any visual effect that CSS and React support.

Does Remotion Captions work with AI video generation agents?+

Yes. This skill is designed for the Remotion Agent Skill ecosystem. An AI agent can generate a video script, synthesize speech, transcribe it for timing, and render captions automatically. The entire pipeline can run without human intervention.

How accurate is the word-level timing?+

Accuracy depends on the transcription backend. Whisper typically achieves within 50-100ms accuracy for word boundaries on clean audio. Background noise, music, or overlapping speech reduces accuracy. Post-processing alignment tools can improve timing for difficult audio.

Can I add captions in multiple languages?+

Yes. You can render multiple caption tracks by providing different transcription outputs for each language. Remotion supports rendering multiple subtitle components simultaneously, so you can show bilingual captions or let users switch between languages in the final video.

Citations (3)
  • Remotion GitHub— Remotion is a framework for making videos programmatically in React
  • Remotion Docs— @remotion/captions package for subtitle rendering
  • OpenAI Whisper— Whisper provides word-level transcription timestamps
🙏

Source & Thanks

Created by Remotion. Licensed under MIT. remotion-dev/skills — Subtitles rule

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.