Template: YouTube Montage (v1.1.0)

A cinematic storytelling video that weaves short clips sourced from YouTube into a narrative held together by single-pass cloned voiceover and on-screen captions. Used for evolution stories, retrospectives, "state of X" pieces, founder origin stories, and anything where the visual material exists publicly and the editorial value is in the curation plus narration.

Canonical sample output: generations/20260411-080631/output.mp4 — "The Evolution of AI," 108.7s, 16 scenes, 12 YouTube clips, approved review. Study this before making your own.

When to use

"Evolution of [topic]" videos
"History of [topic]" videos
"What just happened with [topic]" recaps
"State of [industry]" essays
News montages tied to a thesis
Cultural moment explainers
Founder / company origin stories built from archival footage
Any video where the team isn't shooting original footage and the b-roll comes from public sources

When NOT to use

Product demos (use zavis-template-product-spotlight once it exists)
Brand films with custom shoots (go custom, not template)
Talking-head research reports (use zavis-template-research-talkinghead once it exists)
5-10 second social teasers (the montage template needs at least 30s to breathe)

The brief this template needs

title: string              # Public-facing title
topic: string              # 1-2 sentences describing the video
intent: string             # Why the video exists — what it should LAND in the viewer's head
duration: number           # Target length 30-180s (will be auto-rebudgeted to fit narration)
aspectRatio: "16:9" | "9:16" | "1:1" | "4:5"
emphasis?: string[]        # 3-5 key beats the script must hit
avoid?: string[]           # Framings, tones, phrasings to stay away from
references?: string[]      # Optional reference videos or articles

If the user's prompt is missing title, topic, intent, or aspectRatio, ASK — don't guess.

Everything else can be inferred from a strong intent.

The two clip modes (critical distinction)

Every scene with a YouTube clip declares one of two modes.

Mode A — Source-audio (`modes: ["source-audio"]`)

The viewer hears the original audio from the clip. Used when the audio IS the content — a news anchor saying a keyword, a person delivering a memorable line, a crowd reaction. Usually short (1-3s). Narration automatically opts out during these scenes. The trim is centered on the keyword found in the YouTube transcript.

Mode B — Muted b-roll (`modes: ["muted-broll"]`)

The clip's audio is muted (both the muted prop AND volume=0 — belt and suspenders). Narration plays over it. Used when the visual is the value and the audio would clash. Most body scenes are this mode.

The script must declare the mode for each clip. A scene with `mode: "source-audio"` and a long narration is a contradiction — the preflight check will reject it.

The two-tier query strategy (quality lever #1)

This is the most important decision in the entire template.

Tier 1 — Named entities

The narration mentions a specific recognizable person, product, event, or company. The viewer expects to see that specific thing. Showing generic b-roll here is almost criminal.

Rule: query with the entity name itself, plus a disambiguator.

Examples from the canonical sample:

"Sam Altman" → Sam Altman OpenAI interview 2021 → OpenAI's own Scholars Demo Day footage
"Geoffrey Hinton" → Geoffrey Hinton deep learning interview university Toronto → UofT's own Hinton video (86% relevance, trusted source boost)
"DeepMind AlphaGo" → DeepMind AlphaGo Lee Sedol match 2016 press conference → actual Arirang News coverage of the match
"ChatGPT launch" → ChatGPT launch November 2022 news coverage → BBC News
"GPT-4" → GPT-4 demo OpenAI launch March 2023 → launch coverage

Tier 2 — Concept / atmosphere

The narration describes a feeling, metaphor, or abstract moment. The viewer has no specific expectation — they need a visual that MATCHES THE EMOTION.

Rule: query with a visual shot description — subject + action + composition + lighting.

Examples:

"The AI winter" → snow winter empty street cinematic slow (atmospheric match)
"Dartmouth 1956" → 1950s computer scientists laboratory archival footage black and white
"Modern data centers" → modern data center servers blue lights aerial

How to decide

Ask: "Is there a specific thing the viewer will feel cheated if they don't see?"

Yes → Tier 1
No → Tier 2
When in doubt, err Tier 1 for named entities post-2015, err Tier 2 for historical or abstract moments.

See skills/zavis-clip-curator/SKILL.md for the full discipline.

Narration discipline (quality lever #2)

Single-pass by default

Use the zavis narrator profile. The pipeline generates the entire video's narration in ONE ElevenLabs /with-timestamps call, then cuts it into per-scene clips using the returned character alignment. One API call, continuous flow, no between-scene tonal drift.

Voice settings (locked for v1.1.0)

stability:        0.78   // high — consistent scripted energy
similarity_boost: 0.9    // high — locked voice identity
style:            0.15   // low — no filler pauses or breath sounds
use_speaker_boost: true
tempo:            0.85   // ffmpeg atempo post-process, pitch-preserving

Lowering stability or raising style makes the clone insert filler pauses. DO NOT TUNE without reading the v4 failure log.

Writing rules

Full sentences, not period splits. "First came. GPT one. Then three." → WRONG. "A small lab called OpenAI started scaling it up. GPT one. GPT two. GPT three." → RIGHT. The trailing list is intentional list-rhythm.
No ellipses. The pipeline already strips them; never add them.
Parallel narrative rule: nothing the narrator says should also be on screen as text. Narrator adds context/consequence/contrast — the visual + caption carries the literal content.
First sentence rule: under 8 words, concrete noun, curiosity loop. Example from canonical: "Seventy years ago a handful of people decided that machines could think."
Read it aloud. If you stumble, the narrator will too.

See skills/zavis-narration-director/SKILL.md for the full discipline.

The 6-stage pipeline (what happens when the orchestrator runs)

 User brief
     │
     ▼
┌────────────────────────────────┐
│  1. Source YouTube clips       │  ~3 min
│     yt-dlp search → blacklist  │
│     → relevance ≥20% → rank    │
│     → download → OCR check     │
└────────────────────────────────┘
     │
     ▼
┌────────────────────────────────┐
│  2. Single-pass narration      │  ~25 sec
│     concat → ElevenLabs        │
│     /with-timestamps → atempo  │
│     → ffmpeg cut → alignment   │
└────────────────────────────────┘
     │
     ▼
┌────────────────────────────────┐
│  3. Auto-rebudget scenes       │  <1 sec
│     measure VO durations,      │
│     extend overflowing slots,  │
│     recompute startSec         │
└────────────────────────────────┘
     │
     ▼
┌────────────────────────────────┐
│  4. Preflight audio review     │  <1 sec
│     NON-BYPASSABLE check:      │
│     no VO overlap, no overflow │
└────────────────────────────────┘
     │
     ▼
┌────────────────────────────────┐
│  5. Render with Remotion       │  ~4-5 min
│     inputProps includes        │
│     alignmentSlices for sync   │
└────────────────────────────────┘
     │
     ▼
┌────────────────────────────────┐
│  6. Deterministic review       │  ~8 sec
│     black frames, audio clip,  │
│     fps, resolution, duration  │
└────────────────────────────────┘
     │
     ▼
 Approved output OR iterate

See packages/templates/youtube-montage/manifest.json for the machine-readable pipeline spec.

The skills you should load (in order)

When asked to make a YouTube montage:

`zavis-master` (global) — Zavis brand context
`zavis-video-os` — system architecture + core principles
`zavis-taste-dna` — visual aesthetic, X& reference, typography
`zavis-template-youtube-montage` (this skill) — authoritative template reference
`zavis-story-writer` — beat sheet, three-act arc, tension curve
`zavis-hook-engineer` — first 3 seconds
`zavis-engagement-engineer` — retention budget, CTA integration
`zavis-narration-director` — single-pass narration rules
`zavis-clip-curator` — two-tier query strategy (THIS IS THE #1 QUALITY LEVER)
`zavis-video-reviewer` — post-render review discipline

If the user's brief is vague on the core inputs, load zavis-template-router (if it exists) and ask clarifying questions.

The agent flow (what Claude does when invoked via the MCP)

 1. Gather brief from user (title, topic, intent, duration, aspectRatio)
 2. Ask clarifying questions only if core fields are missing
 3. MCP: list_templates                          → confirm youtube-montage
 4. MCP: get_template("youtube-montage")         → read manifest
 5. MCP: get_template_playbook("youtube-montage")→ load THE playbook into context
 6. MCP: get_skill(...) x 5-6                    → load specialist skills
 7. Author beat sheet in your own reasoning      (story-writer discipline)
 8. Author hook LAST                             (hook-engineer discipline)
 9. Author narration as full sentences           (narration-director discipline)
10. Author Tier 1/Tier 2 queries per scene       (clip-curator discipline)
11. Assemble the full ScriptDocument JSON in your reasoning
12. MCP: validate_script(id, script)             → dry-run zod + contiguity
13. If errors: fix and re-validate               (loop 12-13)
14. If valid: (optional) show summary to user for approval
15. MCP: run_video_pipeline(id, script)          → get runId back
16. MCP: get_run_status(runId) on a loop         → report progress
17. If review flags issues: read get_run_status output, fix the offending
    scenes in the script, call validate_script + run_video_pipeline again
18. MCP: get_generation(runId)                   → return video URL to user

Critical: the MCP server does NOT write scripts. You (the calling Claude) write the ScriptDocument in your own reasoning after loading the playbook + skills. This is the correct MCP pattern — tools are data/actions, the client is the agent.

What the composition does at render time

The Remotion composition at packages/templates/youtube-montage/src/Composition.tsx:

- `hook` (color-fill) → black AbsoluteFill (cold open) - `hook` (title-card) / `title-card` / `cta` → <HeroCard> (big text on dark background) - `outro` → <EndCard> (Zavis logo + integrated tagline + fade out) - `body` → <ClipScene> (full-bleed video, dark-graded, vignette, grain) PLUS a small <LowerThird> corner legend if textOverlay is present. Body is CLIP-PRIMARY.

Reads script, clipPaths, voiceoverPaths, alignmentSlices from inputProps.
For each scene, dispatches based on scene type:
Layer <Captions> across the whole video — consumes alignmentSlices for exact per-phrase timing.
Layer a music track (if specified) as a quiet ambient bed at volume 0.14.
Layer all VO clips, each wrapped in a `<Sequence>` with `durationInFrames` matching its scene window — this guarantees a long VO file cannot bleed past its scene boundary.

Audio routing is strict:

Background videos use muted prop AND volume={0} (belt + suspenders).
Source-audio scenes opt out of narration playback for their window.
Music plays under the whole video at fixed low volume.

End card rule

The EndCard component suppresses any tagline that normalizes to "zavis" (case-insensitive, period-stripped). This prevents rendering a duplicate wordmark alongside the logo image.

The logo asset (logo/zavis-logo-light.svg) IS the "zavis." wordmark with the green dot.
The tagline is reserved for an INTEGRATED message that extends the video's argument — not a brand shout.
A subtagline in small uppercase letter-spaced text can add a short context line.

NEVER pass tagline: "ZAVIS" to EndCard. It gets suppressed, and you end up with an orphaned logo with no accompanying line.

Template-specific rubric (on top of the universal one)

The reviewer must check:

[ ] Every source-audio clip's keyword is actually heard at the trim point
[ ] Every muted b-roll clip's visual is on-topic (Tier 1 entities recognizable, Tier 2 concepts atmospherically aligned)
[ ] No clip is longer than the scene it's placed in
[ ] Title cards aren't held longer than 4 seconds
[ ] The first 3 seconds contain at least one strong hook element (cut OR text OR dramatic sound)
[ ] The narrator never starts speaking during a source-audio clip
[ ] Captions land within 100ms of the spoken word (alignment-driven — easy to verify by scrubbing)
[ ] End card shows exactly ONE Zavis wordmark
[ ] No visible watermarks, channel bugs, or "click to download" banners
[ ] The video lands in 80-110% of target duration (auto-rebudget handles this)

See skills/zavis-video-reviewer/SKILL.md for the full post-render review decision tree.

Versioning

v1.1.0 (current, canonical sample: 20260411-080631)

Changes from v1.0.0:

Tiered query strategy — named entities vs concept/atmosphere
Source quality filter (source-quality.ts) — blacklist + trusted-source boost + lexical relevance gate
Watermark OCR check (watermark-check.ts) — reject clips with persistent static text
Caption sync from alignment data — no more drift
Naturalized narration — no ellipses, full sentences, tuned voice settings
End-card discipline — single Zavis wordmark, integrated tagline
Duration flexibility — soft targets, auto-rebudget extends to fit narration

v1.0.0

Initial release: horizontal aspect ratio, montage, narration, captions

Future work

Animated title cards with brand-specific transitions
Smarter music selection with mood-matching
Stem separation to keep BG ambience from source clips while replacing dialogue
Multi-language narration
Vision-based clip relevance scoring (GPT-4V pass after OCR for final sanity check)
Pexels-first variant for atmosphere-heavy scripts (NOT for this template — YouTube is the point here)