TOKREPO · ARSENAL

Stable

AI Video Generation Pack

Ten picks for the creator and dev generating video from text or image: open-source models (CogVideo, Open-Sora, AnimateDiff, Diffusers), the commercial-API bridge for Sora / Veo / Runway / Pika, camera and motion control via ControlNet and Motion Canvas, Real-ESRGAN upscale, and the assembly editor that stitches it all together.

10 assets

About this pack

What this pack is for

Generating AI video is no longer one model. It is a pipeline: pick a model, write a prompt, condition on a keyframe, control camera or motion, upscale, then cut. This pack assembles the ten picks that cover each stage — opinionated, not exhaustive — so you can move from a blank prompt to a watchable clip without stitching docs from ten different repos.

The stack is split deliberately: open-source models for local control and zero per-second cost, plus a commercial-API bridge for the moments when you genuinely need Sora-class quality and are willing to pay for it. Most working pipelines end up with both.

Install in this order

1. Choose a model

CogVideo (#2458) — text-to-video and image-to-video. Mature open-source baseline from THUDM. Runs on a single high-VRAM GPU. Start here if you want a reproducible local pipeline.
Open-Sora (#109) — the open-source effort to replicate Sora-style results. More capable than CogVideo on long shots and motion coherence, heavier on hardware.
Diffusers (#111) — the Hugging Face hub that loads CogVideo, Stable Video Diffusion, HunyuanVideo, Wan, Mochi, and every new model the week it ships. If you want a single Python interface instead of one repo per model, install Diffusers first and treat the rest as weights.
Together AI Video Generation Skill (#777) — the commercial-API bridge. Use this when you need Sora / Veo / Runway / Pika output quality and don't want to manage GPU infra. Pay per second, ship the same day.

2. Write the prompt

Same rules as image generation, plus motion verbs: "dolly forward", "orbit left", "static lock-off". Models trained on cinematic data respond to film vocabulary.
Keep the subject in the first 12 tokens. Most text-to-video models still front-load attention.

3. Condition on a keyframe

ControlNet (#4664) — feed a pose / depth / canny image to lock down composition. Use it when you have a specific framing in mind and don't want the model to reinvent the shot.
For image-to-video runs, the input image is the keyframe — no ControlNet needed.

4. Add motion

AnimateDiff (#2463) — plug-and-play motion module for Stable Diffusion–family models. Animates an existing image-gen pipeline without retraining. Great for stylized or anime content.
Motion Canvas (#4618) — when the motion you want is deterministic (UI demos, data viz, programmatic camera moves), don't fight a diffusion model — write the motion in code.

5. Upscale

Real-ESRGAN (#2495) — practical 4× super-resolution that handles video. Most generation models output 512×512 or 720p; Real-ESRGAN is how you ship 4K. Run it as the last step before encode.

6. Assemble

OpenCut (#4027) — open-source AI video editor. Trim, splice, color-match generated clips. Avoids the export round-trip to a closed NLE.
Generative Media Skills (#3602) — the muapi + npx skill installer that unifies a dozen commercial generation APIs behind one CLI. Useful when an agent needs to call "generate a 5-second clip" without picking a vendor every time.

How they fit together

Prompt ─► CogVideo / Open-Sora / Diffusers / Together API
               │
               ▼
        Raw 720p clips
               │
   ControlNet ─┤ (optional: lock framing)
   AnimateDiff ┤ (optional: add motion to image)
   Motion Canvas ┤ (optional: deterministic moves)
               ▼
        Real-ESRGAN  ─►  4K upscaled clips
               │
               ▼
            OpenCut
               │
               ▼
         Final cut (mp4)

The split that matters: diffusion models hallucinate motion, code-based tools dictate motion. Use diffusion (CogVideo, Open-Sora, AnimateDiff) when you want surprises and atmosphere. Use Motion Canvas when the camera path is non-negotiable and the audience will notice drift.

Tradeoffs you'll hit

Local vs API — local generation is free per second but costs you GPU time and tuning. API generation is fast and high-quality but priced per second and locked behind quotas. Run local for iteration, API for the hero shots.
CogVideo vs Open-Sora — CogVideo is more stable to set up and runs on lower VRAM. Open-Sora produces longer, more coherent shots when it works. Start with CogVideo; graduate when the gap matters.
AnimateDiff vs native video models — AnimateDiff bolts motion onto SD checkpoints (huge ecosystem of styles, mediocre coherence). Native video models train end-to-end on video (cleaner motion, fewer styles). Pick by content: stylized → AnimateDiff, realistic → CogVideo / Open-Sora.
Real-ESRGAN vs paid upscalers — Real-ESRGAN is free and good enough for most web delivery. Topaz Video AI and similar paid tools are sharper on faces but cost real money. Ship Real-ESRGAN first; upgrade only if reviewers complain.

Common pitfalls

VRAM math — CogVideo-5B needs ~24 GB for 720p generation. Open-Sora can demand 40 GB+. Read the model card before renting a GPU.
Prompt drift across frames — long shots from any diffusion model drift in identity and lighting after ~3 seconds. Generate in 3-second chunks and stitch in OpenCut rather than fighting the model for a 10-second take.
Audio is separate — none of these tools generate matching audio. Plan a separate TTS / SFX pass; the assembly happens in OpenCut.
Commercial-API terms — every commercial generation provider has different rules on commercial reuse, training opt-out, and watermarking. Read the TOS before publishing client work.

INSTALL · ONE COMMAND

$ tokrepo install pack/ai-video-generation-pack

hand it to your agent — or paste it in your terminal

What's inside

10 assets in this pack

Skill#01

CogVideo — Text and Image to Video Generation

An open-source video generation framework from Zhipu AI supporting text-to-video and image-to-video with CogVideoX models. Generates high-quality clips up to 6 seconds.

by Script Depot·354 views

$ tokrepo install cogvideo-text-image-video-generation-7e2317bb

Skill#02

Open-Sora — Open-Source Text-to-Video Generation

Open-source alternative to Sora by HPC-AI Tech. Generate videos from text prompts with an 11B parameter model. Apache 2.0 licensed. 28,800+ stars.

by Script Depot·370 views

$ tokrepo install open-sora-open-source-text-video-generation-ff30d766

Skill#03

Together AI Video Generation Skill for Claude Code

Skill that teaches Claude Code Together AI's video generation API. Covers text-to-video, image-to-video, and keyframe control for AI-powered video creation workflows.

by Together AI·244 views

$ tokrepo install together-ai-video-generation-skill-claude-code-d848ded0

Skill#04

Diffusers — Universal Video & Image Generation Hub

Hugging Face's diffusion model library. Run CogVideoX, AnimateDiff, Stable Video Diffusion, and 50+ video/image models with a unified API. 33,200+ stars.

by Script Depot·372 views

$ tokrepo install diffusers-universal-video-image-generation-hub-4ef1950f

Skill#05

AnimateDiff — Plug-and-Play Animation for Diffusion Models

A plug-and-play motion module that turns community text-to-image Stable Diffusion models into animation generators without additional training. ICLR 2024 Spotlight paper.

by AI Open Source·220 views

$ tokrepo install animatediff-plug-play-animation-diffusion-models-04d7fee0

Skill#06

Real-ESRGAN — Practical Image and Video Super-Resolution

General-purpose image and video restoration tool that trains on pure synthetic data to handle real-world degradations including blur, noise, JPEG compression, and resize artifacts.

by AI Open Source·164 views

$ tokrepo install real-esrgan-practical-image-video-super-resolution-73d0fc65

Skill#07

ControlNet — Add Spatial Control to Diffusion Models

ControlNet lets you add precise spatial conditioning such as edge maps, depth, and pose to Stable Diffusion, giving fine-grained control over AI image generation.

by AI Open Source·132 views

$ tokrepo install controlnet-add-spatial-control-diffusion-models-74fc6ef5

Skill#08

Motion Canvas — Create Animated Videos with Code

A TypeScript library and editor for creating publication-quality animated videos programmatically, combining the precision of code with a visual preview workflow.

by AI Open Source·159 views

$ tokrepo install motion-canvas-create-animated-videos-code-1a626bf6

Skill#09

OpenCut — Open-Source AI Video Editor

An open-source alternative to CapCut for video editing with AI-assisted features, timeline editing, and professional export options.

by Script Depot·274 views

$ tokrepo install opencut-open-source-ai-video-editor-f40e235a

Skill#10

Generative Media Skills — muapi + npx skills add

Generative Media Skills is a multi-modal skill library: run image/video recipes via muapi-cli, installable into Claude Code/Cursor with `npx skills add`.

by Skill Factory·231 views

$ tokrepo install generative-media-skills-muapi-npx-skills-add

FAQ

Frequently asked questions

Which model should I start with — CogVideo or Open-Sora?

Start with CogVideo unless you already have a 40+ GB GPU and a reason to push for longer shots. CogVideo runs on a single 24 GB card, the documentation is more complete, and the failure modes are well understood. Move to Open-Sora when CogVideo's clip-length ceiling is the bottleneck — not before.

Do I actually need both an open-source model and a commercial API skill?

Most working pipelines end up with both. Local models give you free iteration, deterministic seeds, and no per-second cost — great for testing 50 variants of a prompt. Commercial APIs (via the Together AI skill or Generative Media Skills) give you Sora / Veo / Runway-class output for the final hero shot. The split is iteration vs delivery.

How do I control the camera, not just the subject?

Two paths. For diffusion models, write camera verbs into the prompt ("slow dolly forward", "static lock-off", "orbit 90 degrees") — models trained on cinematic captions respond to film vocabulary. When you need exact framing or trajectory, switch to Motion Canvas and program the move in code, then composite the diffusion output into the framed shot.

Why ControlNet for video — isn't it for images?

ControlNet conditions a diffusion step on a structural signal — pose, depth, edges. When that step happens to be the first frame of a video generation, the entire clip inherits that composition. It is the cleanest way to keep generated video on-model when you have a specific framing in mind, especially for product or character shots where you cannot afford the model to reinvent the layout.

Can a single GPU machine actually run this whole pipeline?

Yes if you sequence the stages instead of running them in parallel. Generate with CogVideo (24 GB), unload, then run Real-ESRGAN (~6 GB), then OpenCut on CPU. The bottleneck is the generation step; everything downstream is comparatively cheap. If you only have a 16 GB card, drop to CogVideo's smaller variant or call the commercial API for generation and keep upscale + edit local.

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs