Zavis Video OS
← Templates

YouTube Montage

v1.1.0

Real YouTube footage, single-pass cloned narration, frame-synced captions.

A cinematic storytelling video built from sourced YouTube clips, intercut with title cards and narration. Best for evolution stories, retrospectives, 'state of X' explainers, founder stories, and any editorial piece where the visual layer comes from public sources.

Read the Playbook →
16:99:161:14:530180sdefault 95s @ 30fpsnarrator: Zavis (cloned)
Canonical sample · The Evolution of AI
Run 20260411-080631 · approved · 0 critical issues
duration
108.7s
scenes
16
clips
12
vo clips
14
11labs chars
1,321
render
4m 41s

Good for

  • Evolution / history of an industry or technology
  • State-of-the-field recap videos
  • Retrospective montages tied to a thesis
  • Founder / company origin stories built from archival footage
  • News-moment explainers with a clear editorial angle

Not this template

  • Product demos (use product-spotlight)
  • Original-shoot brand films
  • Talking-head research reports (use research-talkinghead)
  • Under-30-second social teasers

Inputs (what the brief needs)

FieldTypeRequiredDescription
titlestringrequiredPublic-facing title of the video (used in manifest + cards).
topicstringrequired1-2 sentences: what the video is about. This is what the story-writer starts from.
intentstringrequiredWhy the video exists: what feeling or argument it's supposed to land in the viewer's head.
durationnumberoptionalTarget duration in seconds (30-180). Will be auto-rebudgeted to fit narration.default: 95
aspectRatioenumoptionalWhich aspect ratio to render in.default: 16:9
emphasisstring[]optional3-5 key beats or talking points to make sure the script hits.
avoidstring[]optionalFramings, tones, or phrasings to stay away from (e.g. 'doomer framing', 'buzzwords').
referencesstring[]optionalOptional reference videos or articles the story-writer can pull from.

Pipeline (what happens when you run it)

  1. 1
    Source YouTube clips
    ~3m 0s

    For each body scene: yt-dlp search 12 candidates → source-quality filter (blacklist + relevance ≥20%) → rank by quality+relevance+trust → try top results in order → download → watermark check → use clip or fall through.

    yt-dlpsource-quality-filterwatermark-ocr
  2. 2
    Generate single-pass narration
    ~25s

    Concatenate all scene narration (plain period+space separators, no ellipses) → ONE ElevenLabs /with-timestamps call → decode base64 audio → ffmpeg atempo=0.85 → cut into per-scene MP3s using character alignment → persist alignment-slices.json for caption sync.

    elevenlabs-with-timestampsffmpeg-atempoffmpeg-cut
  3. 3
    Auto-rebudget scenes to actual narration
    ~1s

    Measure each voiceover file's actual duration, extend any scene whose narration overflows its pre-budgeted slot (+0.5s breathing room), recompute contiguous startSec cursor. Total video duration grows from the target (this is intentional).

  4. 4
    Preflight audio review
    ~1s

    Build the audio timeline from script + VO files (no rendering). Refuse to proceed if: two narrations overlap, a narration exceeds its scene, source-audio clips overlap. NON-BYPASSABLE.

    preflight-review
  5. 5
    Render with Remotion
    ~4m 40s

    Invoke `npx remotion render` with inputProps={script, clipPaths, voiceoverPaths, alignmentSlices}. The composition dispatches per-scene renderers, wraps every Audio in a duration-bounded Sequence, drives captions from alignment-slices.

    remotion-render
  6. 6
    Deterministic post-render review
    ~8s

    Extract sample frames, compute audio metrics, run all deterministic rubric checks (black frames, audio clipping, fps, resolution, duration). Flag critical issues.

    deterministic-review

Voice profile (locked across this template)

Zavis (cloned)
profile id: zavis
Voice ID
Eju2qVkYu4KE2cJnwGzA
eleven_multilingual_v2

Cloned voice from the Zavis reference reel. Generated in a single ElevenLabs /with-timestamps call for the entire script, then ffmpeg atempo=0.85 post-processed for cinematic pacing. All Zavis YouTube Montage videos use this voice across the board.

Voice settings

stability
0.78
similarity_boost
0.9
style
0.15
use_speaker_boost
true
tempo
0.85

Endpoint

POST https://api.elevenlabs.io/v1/text-to-speech/Eju2qVkYu4KE2cJnwGzA/with-timestamps

How this voice was cloned

Method: ElevenLabs Instant Voice Cloning (IVC) — POST /v1/voices/add

Reference: Instagram Reel

Extraction steps
  1. Downloaded the Reel video via yt-dlp
  2. Extracted the audio track with ffmpeg at 44.1kHz mono
  3. Uploaded the audio as a sample to ElevenLabs voice cloning
  4. Received voice_id Eju2qVkYu4KE2cJnwGzA
Tuning notes
  • The raw clone speaks ~15% too fast — we post-process every generation through ffmpeg atempo=0.85 (pitch-preserving) to land it at cinematic pacing.
  • Voice settings were tuned over v3 → v5: stability 0.55 → 0.78 (v4's conversational drift was causing filler pauses), style 0.40 → 0.15 (lower = fewer breath/um artifacts), similarity_boost 0.85 → 0.90 (stronger identity lock).
  • Do NOT lower stability or raise style without reading the v4 failure notes in the Playbook.
  • The combined narration text is sent with plain period+space scene separators — NEVER ellipses, which caused the v4 narration stutter bug.

Voice samples

Full canonical narration (combined, post-atempo)
89.7s
The entire 'Evolution of AI' narration generated in one ElevenLabs call and tempo-adjusted. This is exactly what you hear when you play the canonical sample video.
Hook alone — 'Machines could think'
4.0s
The 5-second hook beat, cut from the single-pass audio via alignment timestamps.
Closing beat — 'Infrastructure'
8.1s
The reflective landing beat before the CTA.

Skills loaded (in order)

Click any skill to read the full SKILL.md source.

Tools it uses

yt-dlp
Search, transcript, download, trim YouTube clips.
python3 -m yt_dlp
source-quality-filter
Blacklist stock-footage aggregators, boost trusted sources, reject off-topic results via lexical relevance.
packages/pipeline/src/youtube/source-quality.ts
watermark-ocr
Post-download OCR check for persistent on-screen text (tesseract; graceful skip if missing).
packages/pipeline/src/youtube/watermark-check.ts
elevenlabs-with-timestamps
Single-pass narration generation with character-level alignment data.
POST /v1/text-to-speech/{voice_id}/with-timestamps
ffmpeg-atempo
Pitch-preserving tempo adjustment (0.85x) applied to narration post-generation.
/opt/homebrew/bin/ffmpeg
ffmpeg-cut
Slice combined narration into per-scene MP3s at exact alignment timestamps.
/opt/homebrew/bin/ffmpeg
remotion-render
Final composition render (H.264 MP4).
npx remotion render
preflight-review
NON-BYPASSABLE audio timeline check (no VO overlaps, no scene overflow).
packages/pipeline/src/review/preflight.ts
deterministic-review
Post-render checks: black frames, audio clipping, fps, resolution, duration match.
packages/pipeline/src/review/deterministic-checks.ts

Review rubric (template-specific)

  • Every source-audio clip's keyword is actually heard at the trim point
  • Every muted b-roll clip's visual is on-topic for that beat (Tier 1 entities are recognizable, Tier 2 concepts are atmospherically aligned)
  • No clip is longer than the scene it's placed in
  • Title cards are not held longer than 4 seconds
  • The narrator never starts speaking during a source-audio clip
  • Captions land within 100ms of the spoken word (alignment-driven)
  • End card shows exactly one Zavis wordmark
  • No visible watermarks, channel bugs, or 'click to download' banners

Past generations

RunTitleDateDurationScenesStatus
20260411-080631The Evolution of AI4/11/2026108.7s16approved