AI Video Generation Pack
Ten picks for the creator and dev generating video from text or image: open-source models (CogVideo, Open-Sora, AnimateDiff, Diffusers), the commercial-API bridge for Sora / Veo / Runway / Pika, camera and motion control via ControlNet and Motion Canvas, Real-ESRGAN upscale, and the assembly editor that stitches it all together.
What this pack is for
Generating AI video is no longer one model. It is a pipeline: pick a model, write a prompt, condition on a keyframe, control camera or motion, upscale, then cut. This pack assembles the ten picks that cover each stage — opinionated, not exhaustive — so you can move from a blank prompt to a watchable clip without stitching docs from ten different repos.
The stack is split deliberately: open-source models for local control and zero per-second cost, plus a commercial-API bridge for the moments when you genuinely need Sora-class quality and are willing to pay for it. Most working pipelines end up with both.
Install in this order
1. Choose a model
- CogVideo (#2458) — text-to-video and image-to-video. Mature open-source baseline from THUDM. Runs on a single high-VRAM GPU. Start here if you want a reproducible local pipeline.
- Open-Sora (#109) — the open-source effort to replicate Sora-style results. More capable than CogVideo on long shots and motion coherence, heavier on hardware.
- Diffusers (#111) — the Hugging Face hub that loads CogVideo, Stable Video Diffusion, HunyuanVideo, Wan, Mochi, and every new model the week it ships. If you want a single Python interface instead of one repo per model, install Diffusers first and treat the rest as weights.
- Together AI Video Generation Skill (#777) — the commercial-API bridge. Use this when you need Sora / Veo / Runway / Pika output quality and don't want to manage GPU infra. Pay per second, ship the same day.
2. Write the prompt
- Same rules as image generation, plus motion verbs: "dolly forward", "orbit left", "static lock-off". Models trained on cinematic data respond to film vocabulary.
- Keep the subject in the first 12 tokens. Most text-to-video models still front-load attention.
3. Condition on a keyframe
- ControlNet (#4664) — feed a pose / depth / canny image to lock down composition. Use it when you have a specific framing in mind and don't want the model to reinvent the shot.
- For image-to-video runs, the input image is the keyframe — no ControlNet needed.
4. Add motion
- AnimateDiff (#2463) — plug-and-play motion module for Stable Diffusion–family models. Animates an existing image-gen pipeline without retraining. Great for stylized or anime content.
- Motion Canvas (#4618) — when the motion you want is deterministic (UI demos, data viz, programmatic camera moves), don't fight a diffusion model — write the motion in code.
5. Upscale
- Real-ESRGAN (#2495) — practical 4× super-resolution that handles video. Most generation models output 512×512 or 720p; Real-ESRGAN is how you ship 4K. Run it as the last step before encode.
6. Assemble
- OpenCut (#4027) — open-source AI video editor. Trim, splice, color-match generated clips. Avoids the export round-trip to a closed NLE.
- Generative Media Skills (#3602) — the muapi + npx skill installer that unifies a dozen commercial generation APIs behind one CLI. Useful when an agent needs to call "generate a 5-second clip" without picking a vendor every time.
How they fit together
Prompt ─► CogVideo / Open-Sora / Diffusers / Together API
│
▼
Raw 720p clips
│
ControlNet ─┤ (optional: lock framing)
AnimateDiff ┤ (optional: add motion to image)
Motion Canvas ┤ (optional: deterministic moves)
▼
Real-ESRGAN ─► 4K upscaled clips
│
▼
OpenCut
│
▼
Final cut (mp4)
The split that matters: diffusion models hallucinate motion, code-based tools dictate motion. Use diffusion (CogVideo, Open-Sora, AnimateDiff) when you want surprises and atmosphere. Use Motion Canvas when the camera path is non-negotiable and the audience will notice drift.
Tradeoffs you'll hit
- Local vs API — local generation is free per second but costs you GPU time and tuning. API generation is fast and high-quality but priced per second and locked behind quotas. Run local for iteration, API for the hero shots.
- CogVideo vs Open-Sora — CogVideo is more stable to set up and runs on lower VRAM. Open-Sora produces longer, more coherent shots when it works. Start with CogVideo; graduate when the gap matters.
- AnimateDiff vs native video models — AnimateDiff bolts motion onto SD checkpoints (huge ecosystem of styles, mediocre coherence). Native video models train end-to-end on video (cleaner motion, fewer styles). Pick by content: stylized → AnimateDiff, realistic → CogVideo / Open-Sora.
- Real-ESRGAN vs paid upscalers — Real-ESRGAN is free and good enough for most web delivery. Topaz Video AI and similar paid tools are sharper on faces but cost real money. Ship Real-ESRGAN first; upgrade only if reviewers complain.
Common pitfalls
- VRAM math — CogVideo-5B needs ~24 GB for 720p generation. Open-Sora can demand 40 GB+. Read the model card before renting a GPU.
- Prompt drift across frames — long shots from any diffusion model drift in identity and lighting after ~3 seconds. Generate in 3-second chunks and stitch in OpenCut rather than fighting the model for a 10-second take.
- Audio is separate — none of these tools generate matching audio. Plan a separate TTS / SFX pass; the assembly happens in OpenCut.
- Commercial-API terms — every commercial generation provider has different rules on commercial reuse, training opt-out, and watermarking. Read the TOS before publishing client work.
10 assets in this pack
Frequently asked questions
Which model should I start with — CogVideo or Open-Sora?
Start with CogVideo unless you already have a 40+ GB GPU and a reason to push for longer shots. CogVideo runs on a single 24 GB card, the documentation is more complete, and the failure modes are well understood. Move to Open-Sora when CogVideo's clip-length ceiling is the bottleneck — not before.
Do I actually need both an open-source model and a commercial API skill?
Most working pipelines end up with both. Local models give you free iteration, deterministic seeds, and no per-second cost — great for testing 50 variants of a prompt. Commercial APIs (via the Together AI skill or Generative Media Skills) give you Sora / Veo / Runway-class output for the final hero shot. The split is iteration vs delivery.
How do I control the camera, not just the subject?
Two paths. For diffusion models, write camera verbs into the prompt ("slow dolly forward", "static lock-off", "orbit 90 degrees") — models trained on cinematic captions respond to film vocabulary. When you need exact framing or trajectory, switch to Motion Canvas and program the move in code, then composite the diffusion output into the framed shot.
Why ControlNet for video — isn't it for images?
ControlNet conditions a diffusion step on a structural signal — pose, depth, edges. When that step happens to be the first frame of a video generation, the entire clip inherits that composition. It is the cleanest way to keep generated video on-model when you have a specific framing in mind, especially for product or character shots where you cannot afford the model to reinvent the layout.
Can a single GPU machine actually run this whole pipeline?
Yes if you sequence the stages instead of running them in parallel. Generate with CogVideo (24 GB), unload, then run Real-ESRGAN (~6 GB), then OpenCut on CPU. The bottleneck is the generation step; everything downstream is comparatively cheap. If you only have a 16 GB card, drop to CogVideo's smaller variant or call the commercial API for generation and keep upscale + edit local.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs