YouTube Montage

v1.1.0

Real YouTube footage, single-pass cloned narration, frame-synced captions.

A cinematic storytelling video built from sourced YouTube clips, intercut with title cards and narration. Best for evolution stories, retrospectives, 'state of X' explainers, founder stories, and any editorial piece where the visual layer comes from public sources.

Read the Playbook →

16:99:161:14:530–180sdefault 95s @ 30fpsnarrator: Zavis (cloned)

Canonical sample · The Evolution of AI

Run 20260411-080631 · approved · 0 critical issues

duration

108.7s

scenes

clips

vo clips

11labs chars

1,321

render

4m 41s

Good for

Evolution / history of an industry or technology
State-of-the-field recap videos
Retrospective montages tied to a thesis
Founder / company origin stories built from archival footage
News-moment explainers with a clear editorial angle

Not this template

Product demos (use product-spotlight)
Original-shoot brand films
Talking-head research reports (use research-talkinghead)
Under-30-second social teasers

Inputs (what the brief needs)

Field	Type	Required	Description
title	string	required	Public-facing title of the video (used in manifest + cards).
topic	string	required	1-2 sentences: what the video is about. This is what the story-writer starts from.
intent	string	required	Why the video exists: what feeling or argument it's supposed to land in the viewer's head.
duration	number	optional	Target duration in seconds (30-180). Will be auto-rebudgeted to fit narration.default: 95
aspectRatio	enum	optional	Which aspect ratio to render in.default: 16:9
emphasis	string[]	optional	3-5 key beats or talking points to make sure the script hits.
avoid	string[]	optional	Framings, tones, or phrasings to stay away from (e.g. 'doomer framing', 'buzzwords').
references	string[]	optional	Optional reference videos or articles the story-writer can pull from.

Pipeline (what happens when you run it)

1
Source YouTube clips
~3m 0s
For each body scene: yt-dlp search 12 candidates → source-quality filter (blacklist + relevance ≥20%) → rank by quality+relevance+trust → try top results in order → download → watermark check → use clip or fall through.
yt-dlpsource-quality-filterwatermark-ocr
2
Generate single-pass narration
~25s
Concatenate all scene narration (plain period+space separators, no ellipses) → ONE ElevenLabs /with-timestamps call → decode base64 audio → ffmpeg atempo=0.85 → cut into per-scene MP3s using character alignment → persist alignment-slices.json for caption sync.
elevenlabs-with-timestampsffmpeg-atempoffmpeg-cut
3
Auto-rebudget scenes to actual narration
~1s
Measure each voiceover file's actual duration, extend any scene whose narration overflows its pre-budgeted slot (+0.5s breathing room), recompute contiguous startSec cursor. Total video duration grows from the target (this is intentional).
4
Preflight audio review
~1s
Build the audio timeline from script + VO files (no rendering). Refuse to proceed if: two narrations overlap, a narration exceeds its scene, source-audio clips overlap. NON-BYPASSABLE.
preflight-review
5
Render with Remotion
~4m 40s
Invoke `npx remotion render` with inputProps={script, clipPaths, voiceoverPaths, alignmentSlices}. The composition dispatches per-scene renderers, wraps every Audio in a duration-bounded Sequence, drives captions from alignment-slices.
remotion-render
6
Deterministic post-render review
~8s
Extract sample frames, compute audio metrics, run all deterministic rubric checks (black frames, audio clipping, fps, resolution, duration). Flag critical issues.
deterministic-review

Voice profile (locked across this template)

Zavis (cloned)

profile id: zavis

Voice ID

Eju2qVkYu4KE2cJnwGzA

eleven_multilingual_v2

Cloned voice from the Zavis reference reel. Generated in a single ElevenLabs /with-timestamps call for the entire script, then ffmpeg atempo=0.85 post-processed for cinematic pacing. All Zavis YouTube Montage videos use this voice across the board.

Voice settings

stability

0.78

similarity_boost

0.9

style

0.15

use_speaker_boost

true

tempo

0.85

Endpoint

POST https://api.elevenlabs.io/v1/text-to-speech/Eju2qVkYu4KE2cJnwGzA/with-timestamps

How this voice was cloned

Method: ElevenLabs Instant Voice Cloning (IVC) — POST /v1/voices/add

Reference: Instagram Reel

Extraction steps

Downloaded the Reel video via yt-dlp
Extracted the audio track with ffmpeg at 44.1kHz mono
Uploaded the audio as a sample to ElevenLabs voice cloning
Received voice_id Eju2qVkYu4KE2cJnwGzA

Tuning notes

The raw clone speaks ~15% too fast — we post-process every generation through ffmpeg atempo=0.85 (pitch-preserving) to land it at cinematic pacing.
Voice settings were tuned over v3 → v5: stability 0.55 → 0.78 (v4's conversational drift was causing filler pauses), style 0.40 → 0.15 (lower = fewer breath/um artifacts), similarity_boost 0.85 → 0.90 (stronger identity lock).
Do NOT lower stability or raise style without reading the v4 failure notes in the Playbook.
The combined narration text is sent with plain period+space scene separators — NEVER ellipses, which caused the v4 narration stutter bug.

Voice samples

Full canonical narration (combined, post-atempo)

89.7s

The entire 'Evolution of AI' narration generated in one ElevenLabs call and tempo-adjusted. This is exactly what you hear when you play the canonical sample video.

Hook alone — 'Machines could think'

4.0s

The 5-second hook beat, cut from the single-pass audio via alignment timestamps.

Closing beat — 'Infrastructure'

8.1s

The reflective landing beat before the CTA.

Skills loaded (in order)

Click any skill to read the full SKILL.md source.

zavis-masterglobal

Zavis brand context, voice, audience, non-negotiables.

→

zavis-video-osorientation

System architecture — what lives where, what loads when.

→

zavis-taste-dnaaesthetic

Visual discipline: X& reference, color palette, typography scale, motion language.

→

zavis-template-youtube-montagetemplate

This template's authoritative reference. Load FIRST after video-os when making a montage.

→

zavis-story-writerspecialist

Beat-sheet structure: three-act arc, tension curve, landing.

→

zavis-hook-engineerspecialist

First 3 seconds: promise, question, or image that earns the next 92.

→

zavis-engagement-engineerspecialist

Retention budget: when to escalate, when to rest, how the CTA integrates.

→

zavis-narration-directorspecialist

Single-pass narration discipline: full sentences, no ellipses, voice settings, parallel narrative rule.

→

zavis-clip-curatorspecialist

Two-tier query strategy: named entities vs concept/atmosphere. THIS IS THE #1 QUALITY LEVER.

→

zavis-video-reviewerreview

Post-render review loop: deterministic checks + vision checks + decision tree.

→

Tools it uses

yt-dlp

Search, transcript, download, trim YouTube clips.

python3 -m yt_dlp

source-quality-filter

Blacklist stock-footage aggregators, boost trusted sources, reject off-topic results via lexical relevance.

packages/pipeline/src/youtube/source-quality.ts

watermark-ocr

Post-download OCR check for persistent on-screen text (tesseract; graceful skip if missing).

packages/pipeline/src/youtube/watermark-check.ts

elevenlabs-with-timestamps

Single-pass narration generation with character-level alignment data.

POST /v1/text-to-speech/{voice_id}/with-timestamps

ffmpeg-atempo

Pitch-preserving tempo adjustment (0.85x) applied to narration post-generation.

/opt/homebrew/bin/ffmpeg

ffmpeg-cut

Slice combined narration into per-scene MP3s at exact alignment timestamps.

/opt/homebrew/bin/ffmpeg

remotion-render

Final composition render (H.264 MP4).

npx remotion render

preflight-review

NON-BYPASSABLE audio timeline check (no VO overlaps, no scene overflow).

packages/pipeline/src/review/preflight.ts

deterministic-review

Post-render checks: black frames, audio clipping, fps, resolution, duration match.

packages/pipeline/src/review/deterministic-checks.ts

Review rubric (template-specific)

Every source-audio clip's keyword is actually heard at the trim point
Every muted b-roll clip's visual is on-topic for that beat (Tier 1 entities are recognizable, Tier 2 concepts are atmospherically aligned)
No clip is longer than the scene it's placed in
Title cards are not held longer than 4 seconds
The narrator never starts speaking during a source-audio clip
Captions land within 100ms of the spoken word (alignment-driven)
End card shows exactly one Zavis wordmark
No visible watermarks, channel bugs, or 'click to download' banners

Past generations

Run	Title	Date	Duration	Scenes	Status
20260411-080631	The Evolution of AI	4/11/2026	108.7s	16	approved