Template: YouTube Montage (v1.1.0)
A cinematic storytelling video that weaves short clips sourced from YouTube into a narrative held together by single-pass cloned voiceover and on-screen captions. Used for evolution stories, retrospectives, "state of X" pieces, founder origin stories, and anything where the visual material exists publicly and the editorial value is in the curation plus narration.
Canonical sample output: generations/20260411-080631/output.mp4 — "The Evolution of AI," 108.7s, 16 scenes, 12 YouTube clips, approved review. Study this before making your own.
When to use
- "Evolution of [topic]" videos
- "History of [topic]" videos
- "What just happened with [topic]" recaps
- "State of [industry]" essays
- News montages tied to a thesis
- Cultural moment explainers
- Founder / company origin stories built from archival footage
- Any video where the team isn't shooting original footage and the b-roll comes from public sources
When NOT to use
- Product demos (use
zavis-template-product-spotlightonce it exists) - Brand films with custom shoots (go custom, not template)
- Talking-head research reports (use
zavis-template-research-talkingheadonce it exists) - 5-10 second social teasers (the montage template needs at least 30s to breathe)
The brief this template needs
title: string # Public-facing title
topic: string # 1-2 sentences describing the video
intent: string # Why the video exists — what it should LAND in the viewer's head
duration: number # Target length 30-180s (will be auto-rebudgeted to fit narration)
aspectRatio: "16:9" | "9:16" | "1:1" | "4:5"
emphasis?: string[] # 3-5 key beats the script must hit
avoid?: string[] # Framings, tones, phrasings to stay away from
references?: string[] # Optional reference videos or articlesIf the user's prompt is missing title, topic, intent, or aspectRatio, ASK — don't guess.
Everything else can be inferred from a strong intent.
The two clip modes (critical distinction)
Every scene with a YouTube clip declares one of two modes.
Mode A — Source-audio (modes: ["source-audio"])
The viewer hears the original audio from the clip. Used when the audio IS the content — a news anchor saying a keyword, a person delivering a memorable line, a crowd reaction. Usually short (1-3s). Narration automatically opts out during these scenes. The trim is centered on the keyword found in the YouTube transcript.
Mode B — Muted b-roll (modes: ["muted-broll"])
The clip's audio is muted (both the muted prop AND volume=0 — belt and suspenders). Narration plays over it. Used when the visual is the value and the audio would clash. Most body scenes are this mode.
The script must declare the mode for each clip. A scene with `mode: "source-audio"` and a long narration is a contradiction — the preflight check will reject it.
The two-tier query strategy (quality lever #1)
This is the most important decision in the entire template.
Tier 1 — Named entities
The narration mentions a specific recognizable person, product, event, or company. The viewer expects to see that specific thing. Showing generic b-roll here is almost criminal.
Rule: query with the entity name itself, plus a disambiguator.
Examples from the canonical sample:
- "Sam Altman" →
Sam Altman OpenAI interview 2021→ OpenAI's own Scholars Demo Day footage - "Geoffrey Hinton" →
Geoffrey Hinton deep learning interview university Toronto→ UofT's own Hinton video (86% relevance, trusted source boost) - "DeepMind AlphaGo" →
DeepMind AlphaGo Lee Sedol match 2016 press conference→ actual Arirang News coverage of the match - "ChatGPT launch" →
ChatGPT launch November 2022 news coverage→ BBC News - "GPT-4" →
GPT-4 demo OpenAI launch March 2023→ launch coverage
Tier 2 — Concept / atmosphere
The narration describes a feeling, metaphor, or abstract moment. The viewer has no specific expectation — they need a visual that MATCHES THE EMOTION.
Rule: query with a visual shot description — subject + action + composition + lighting.
Examples:
- "The AI winter" →
snow winter empty street cinematic slow(atmospheric match) - "Dartmouth 1956" →
1950s computer scientists laboratory archival footage black and white - "Modern data centers" →
modern data center servers blue lights aerial
How to decide
Ask: "Is there a specific thing the viewer will feel cheated if they don't see?"
- Yes → Tier 1
- No → Tier 2
- When in doubt, err Tier 1 for named entities post-2015, err Tier 2 for historical or abstract moments.
See skills/zavis-clip-curator/SKILL.md for the full discipline.
Narration discipline (quality lever #2)
Single-pass by default
Use the zavis narrator profile. The pipeline generates the entire video's narration in ONE ElevenLabs /with-timestamps call, then cuts it into per-scene clips using the returned character alignment. One API call, continuous flow, no between-scene tonal drift.
Voice settings (locked for v1.1.0)
stability: 0.78 // high — consistent scripted energy
similarity_boost: 0.9 // high — locked voice identity
style: 0.15 // low — no filler pauses or breath sounds
use_speaker_boost: true
tempo: 0.85 // ffmpeg atempo post-process, pitch-preservingLowering stability or raising style makes the clone insert filler pauses. DO NOT TUNE without reading the v4 failure log.
Writing rules
- Full sentences, not period splits.
"First came. GPT one. Then three."→ WRONG."A small lab called OpenAI started scaling it up. GPT one. GPT two. GPT three."→ RIGHT. The trailing list is intentional list-rhythm. - No ellipses. The pipeline already strips them; never add them.
- Parallel narrative rule: nothing the narrator says should also be on screen as text. Narrator adds context/consequence/contrast — the visual + caption carries the literal content.
- First sentence rule: under 8 words, concrete noun, curiosity loop. Example from canonical: "Seventy years ago a handful of people decided that machines could think."
- Read it aloud. If you stumble, the narrator will too.
See skills/zavis-narration-director/SKILL.md for the full discipline.
The 6-stage pipeline (what happens when the orchestrator runs)
User brief
│
▼
┌────────────────────────────────┐
│ 1. Source YouTube clips │ ~3 min
│ yt-dlp search → blacklist │
│ → relevance ≥20% → rank │
│ → download → OCR check │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 2. Single-pass narration │ ~25 sec
│ concat → ElevenLabs │
│ /with-timestamps → atempo │
│ → ffmpeg cut → alignment │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 3. Auto-rebudget scenes │ <1 sec
│ measure VO durations, │
│ extend overflowing slots, │
│ recompute startSec │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 4. Preflight audio review │ <1 sec
│ NON-BYPASSABLE check: │
│ no VO overlap, no overflow │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 5. Render with Remotion │ ~4-5 min
│ inputProps includes │
│ alignmentSlices for sync │
└────────────────────────────────┘
│
▼
┌────────────────────────────────┐
│ 6. Deterministic review │ ~8 sec
│ black frames, audio clip, │
│ fps, resolution, duration │
└────────────────────────────────┘
│
▼
Approved output OR iterateSee packages/templates/youtube-montage/manifest.json for the machine-readable pipeline spec.
The skills you should load (in order)
When asked to make a YouTube montage:
- `zavis-master` (global) — Zavis brand context
- `zavis-video-os` — system architecture + core principles
- `zavis-taste-dna` — visual aesthetic, X& reference, typography
- `zavis-template-youtube-montage` (this skill) — authoritative template reference
- `zavis-story-writer` — beat sheet, three-act arc, tension curve
- `zavis-hook-engineer` — first 3 seconds
- `zavis-engagement-engineer` — retention budget, CTA integration
- `zavis-narration-director` — single-pass narration rules
- `zavis-clip-curator` — two-tier query strategy (THIS IS THE #1 QUALITY LEVER)
- `zavis-video-reviewer` — post-render review discipline
If the user's brief is vague on the core inputs, load zavis-template-router (if it exists) and ask clarifying questions.
The agent flow (what Claude does when invoked via the MCP)
1. Gather brief from user (title, topic, intent, duration, aspectRatio)
2. Ask clarifying questions only if core fields are missing
3. MCP: list_templates → confirm youtube-montage
4. MCP: get_template("youtube-montage") → read manifest
5. MCP: get_template_playbook("youtube-montage")→ load THE playbook into context
6. MCP: get_skill(...) x 5-6 → load specialist skills
7. Author beat sheet in your own reasoning (story-writer discipline)
8. Author hook LAST (hook-engineer discipline)
9. Author narration as full sentences (narration-director discipline)
10. Author Tier 1/Tier 2 queries per scene (clip-curator discipline)
11. Assemble the full ScriptDocument JSON in your reasoning
12. MCP: validate_script(id, script) → dry-run zod + contiguity
13. If errors: fix and re-validate (loop 12-13)
14. If valid: (optional) show summary to user for approval
15. MCP: run_video_pipeline(id, script) → get runId back
16. MCP: get_run_status(runId) on a loop → report progress
17. If review flags issues: read get_run_status output, fix the offending
scenes in the script, call validate_script + run_video_pipeline again
18. MCP: get_generation(runId) → return video URL to userCritical: the MCP server does NOT write scripts. You (the calling Claude) write the ScriptDocument in your own reasoning after loading the playbook + skills. This is the correct MCP pattern — tools are data/actions, the client is the agent.
What the composition does at render time
The Remotion composition at packages/templates/youtube-montage/src/Composition.tsx:
- `hook` (color-fill) → black AbsoluteFill (cold open) - `hook` (title-card) / `title-card` / `cta` → <HeroCard> (big text on dark background) - `outro` → <EndCard> (Zavis logo + integrated tagline + fade out) - `body` → <ClipScene> (full-bleed video, dark-graded, vignette, grain) PLUS a small <LowerThird> corner legend if textOverlay is present. Body is CLIP-PRIMARY.
- Reads
script,clipPaths,voiceoverPaths,alignmentSlicesfrominputProps. - For each scene, dispatches based on scene type:
- Layer
<Captions>across the whole video — consumesalignmentSlicesfor exact per-phrase timing. - Layer a music track (if specified) as a quiet ambient bed at volume 0.14.
- Layer all VO clips, each wrapped in a `<Sequence>` with `durationInFrames` matching its scene window — this guarantees a long VO file cannot bleed past its scene boundary.
Audio routing is strict:
- Background videos use
mutedprop ANDvolume={0}(belt + suspenders). - Source-audio scenes opt out of narration playback for their window.
- Music plays under the whole video at fixed low volume.
End card rule
The EndCard component suppresses any tagline that normalizes to "zavis" (case-insensitive, period-stripped). This prevents rendering a duplicate wordmark alongside the logo image.
- The logo asset (
logo/zavis-logo-light.svg) IS the "zavis." wordmark with the green dot. - The tagline is reserved for an INTEGRATED message that extends the video's argument — not a brand shout.
- A subtagline in small uppercase letter-spaced text can add a short context line.
NEVER pass tagline: "ZAVIS" to EndCard. It gets suppressed, and you end up with an orphaned logo with no accompanying line.
Template-specific rubric (on top of the universal one)
The reviewer must check:
- [ ] Every source-audio clip's keyword is actually heard at the trim point
- [ ] Every muted b-roll clip's visual is on-topic (Tier 1 entities recognizable, Tier 2 concepts atmospherically aligned)
- [ ] No clip is longer than the scene it's placed in
- [ ] Title cards aren't held longer than 4 seconds
- [ ] The first 3 seconds contain at least one strong hook element (cut OR text OR dramatic sound)
- [ ] The narrator never starts speaking during a source-audio clip
- [ ] Captions land within 100ms of the spoken word (alignment-driven — easy to verify by scrubbing)
- [ ] End card shows exactly ONE Zavis wordmark
- [ ] No visible watermarks, channel bugs, or "click to download" banners
- [ ] The video lands in 80-110% of target duration (auto-rebudget handles this)
See skills/zavis-video-reviewer/SKILL.md for the full post-render review decision tree.
Versioning
v1.1.0 (current, canonical sample: 20260411-080631)
Changes from v1.0.0:
- Tiered query strategy — named entities vs concept/atmosphere
- Source quality filter (
source-quality.ts) — blacklist + trusted-source boost + lexical relevance gate - Watermark OCR check (
watermark-check.ts) — reject clips with persistent static text - Caption sync from alignment data — no more drift
- Naturalized narration — no ellipses, full sentences, tuned voice settings
- End-card discipline — single Zavis wordmark, integrated tagline
- Duration flexibility — soft targets, auto-rebudget extends to fit narration
v1.0.0
- Initial release: horizontal aspect ratio, montage, narration, captions
Future work
- Animated title cards with brand-specific transitions
- Smarter music selection with mood-matching
- Stem separation to keep BG ambience from source clips while replacing dialogue
- Multi-language narration
- Vision-based clip relevance scoring (GPT-4V pass after OCR for final sanity check)
- Pexels-first variant for atmosphere-heavy scripts (NOT for this template — YouTube is the point here)