The YouTube Montage Playbook

Everything that actually goes into making a video that feels right.

The first-principles thinking, the prompt craft, the Zavis integration rule,

the sourcing approach, the failure modes, and a full case study.

If you are a team member (human or agent) trying to reproduce the quality

of the canonical sample (generations/20260411-080631), read this end-to-end

before writing a single line of script.

How to use this template via the MCP (the normal path)

The Zavis Video OS MCP is a pure tool layer. It does NOT write scripts for you. When you're inside Claude Code and the user asks for a video, you (the calling Claude) write the ScriptDocument in your own reasoning, after loading this playbook and the relevant skills as context.

The 10-step flow you execute end-to-end:

1.  list_templates                     → confirm youtube-montage exists
2.  get_template("youtube-montage")    → read the manifest (voice profile,
                                          duration bounds, aspect ratios, rubric)
3.  get_template_playbook("youtube-montage")
                                        → load THIS file into working context
4.  get_skill("zavis-prompt-craft")    → load the prompt discipline
    get_skill("zavis-story-writer")    → load beat-sheet/tension-curve rules
    get_skill("zavis-hook-engineer")   → load first-3-seconds rules
    get_skill("zavis-narration-director")  → no ellipses, full sentences, pace
    get_skill("zavis-clip-curator")    → Tier 1 (named) vs Tier 2 (concept)
    get_skill("zavis-engagement-engineer") → retention budget, CTA
5.  [You reason hard and write a full ScriptDocument JSON. Use Parts 1-7 of
     this playbook as the authoritative reference. Extract the argument first,
     then the tension curve, then the beats, then the narration, then the
     Tier 1/Tier 2 queries.]
6.  validate_script(templateId, script) → returns {valid, errors?, summary?}
7.  [If invalid, fix the errors listed and re-validate. If valid, optionally
     show the summary to the user for approval.]
8.  run_video_pipeline(templateId, script) → returns {runId, frontendUrl}
9.  get_run_status(runId)               → poll every 30s, report progress
10. get_generation(runId)               → deliver the final video URL

Key: the validate_script tool tells you EVERYTHING about whether your script is valid — zod path errors, contiguity issues, energyArc length mismatches, pacing hints. Always call it before run_video_pipeline. It's a dry-run that costs nothing.

run_video_pipeline is the commit point. After you call it, the orchestrator is spawned detached, starts pulling YouTube clips, generates narration, renders, and reviews. If something goes wrong downstream (e.g., a clip can't be sourced), the review will flag it and the run status becomes iterating. You can then re-write the relevant scenes and run again.

Part 1 — First principles

1. A video is a two-channel signal, not a slideshow

The viewer has two bandwidth channels open simultaneously:

The eyes see a visual (the clip, the title card, the caption, the overlay).
The ears hear a signal (the narration, the music, sometimes source audio from the clip).

The worst thing you can do is use both channels to carry the same content. If the narrator says "ChatGPT launched in November 2022" while the caption says "CHATGPT LAUNCHED NOVEMBER 2022," one of those channels is wasted. The viewer's brain multiplexes two channels when they are COMPLEMENTARY, and it feels like magic. When they're redundant, it feels flat.

The rule: narration = what the viewer CAN'T see. Visual + caption = what is obvious from the shot. Narration adds context, consequence, contrast, feeling. The visual carries the literal content.

Concrete example from the Evolution of AI canonical video:

Scene: AlphaGo match clip (visual is obviously a Go board match)
Caption: "2016 — ALPHAGO"
Narration: "Four years later DeepMind's AlphaGo beat the world champion at a game they said computers would never master."
Notice: the narration never says the word "Go" — it doesn't have to, the visual carries it. The narration adds the part the visual can't: why it mattered ("never master"), the scale ("world champion"), the time vector ("four years later").

2. The viewer is deciding every 1.5 seconds whether to keep watching

Every 1.5 seconds there is a micro-moment where the viewer's brain asks "is this still worth my time?" If two of those moments miss, they're gone. That's why the hook is everything, why you never rest on a static visual for more than ~4 seconds, why you always end a scene on momentum into the next scene, and why narration periods should feel like dominoes falling, not a lecture's pacing.

This is also why the clip-narration tier match matters so much. If the narrator says "Sam Altman" and the visual is a generic office interior, the viewer's brain registers wrong, and that counts as one of the missed moments.

3. The best narration comes from writing what you would say to a smart friend

Not "Today we will explore the fascinating history of artificial intelligence." That's classroom prose. A smart friend talking to you about this at a bar would say something like: "Seventy years ago a handful of people decided machines could think — and then for sixty years almost nothing happened." Same information, radically different energy.

The test: read the narration out loud at conversational pace. If you stumble or feel silly, rewrite. If it feels like something you'd actually say, it's probably right.

4. A video is a tension curve, not a timeline

The story-writer's job is to build tension, release it, build higher tension, release, land. A montage that just lists facts in order ("first this happened, then that, then that") is a list. A montage that organizes those same facts around tension is a story.

For the Evolution of AI video the tension curve is:

    │                                  ╭─── GPT-4 / ChatGPT / agents
    │                                 ╱
    │                    ╭── Hinton ╱
    │                   ╱      ╲   ╱   ← the explosion
    │      Dartmouth   ╱        ╲ ╱
    │     ╲          ╱           ╳    ← AlphaGo / Transformer
    │      ╲________╱           ╱ ╲
    │         ← winter ←      ╱   ╲──── closing reflection
    │__________________________╱____________▶ time

The dip (the AI winter) is load-bearing — without it, the rise doesn't feel like a rise, it feels like an inevitability. Tension needs contrast.

5. Brand integration is an OUTCOME, not an INTRUSION

This is the Zavis rule. The worst video has the narrator saying "and that's why we made Zavis" as a bolt-on at the end, like an ad break. The best video has Zavis emerge from the story as the natural next beat — the listener thinks "oh, of course" when the logo lands, not "oh, commercial."

In the canonical sample, the final beats are:

"Seventy years of research. Five years of explosion. What was science fiction is now infrastructure."
"What happens next is being built by the people who use this every day."
[end card] "The story has barely begun" / "zavis is what happens next"

The Zavis connection is made by the STORY ARGUMENT, not by the narration explicitly saying "Zavis does X." The video is about the inevitability of who builds what comes next. Zavis is asserted as one of them — not because the video says so, but because the story's logic makes it feel that way.

This is the non-negotiable brand integration rule: Zavis lands BECAUSE of the argument, not IN SPITE of it.

6. The best clips are either iconic or atmospheric — never generic

A specific clip of Lee Sedol playing AlphaGo is iconic. A generic "neural network visualization" clip is garbage. A snow-covered empty street as a metaphor for the AI winter is atmospheric. A stock-footage-aggregator clip with "Click Here" embedded in the corner is garbage. There's no middle ground. If you can't tell what a clip is trying to say in the first 1.5 seconds, it's garbage.

Part 2 — How to craft the script prompt

Below is the actual prompt structure Claude should use when starting from scratch. Adapt it to the specific topic.

Step 1 — Extract the core argument (not the topic)

Ask: "What is the argument this video makes about the topic?"

The topic is "the evolution of AI." The argument is "It took seventy years of quiet work and then five explosive ones — and we're still at the beginning. The people building what comes next are already here."

The topic is trivia. The argument is the video. Everything else — beat selection, pacing, clip choice, narration phrasing, end card — serves the argument.

If you can't state the argument in one sentence, the script will feel aimless. Stop and iterate until you can.

Step 2 — Build the tension curve as a beat sheet

Write 12-18 single-line beats in tension order, not time order. Example for Evolution of AI:

Hook: "Seventy years ago, a handful of people decided machines could think."
A1: Dartmouth 1956 (the idea)
A2: The AI winter (the dip — 60 years of nothing)
A3: Hinton + AlexNet 2012 (the spark)
A4: DeepMind AlphaGo 2016 (the proof point, still abstract)
A5: Transformer paper 2017 (the architectural turn)
A6: OpenAI scales it: GPT 1, 2, 3 (the ramp)
A7: ChatGPT Nov 2022 (the inflection)
A8: 100 million users (the scale)
A9: GPT-4 multimodal (the breadth)
A10: Midjourney / DALL-E (the creative side)
A11: Agents (the doing, not just answering)
Closing: "Seventy years of research, five years of explosion."
Land: "What happens next is being built by the people who use this every day."
End card: "The story has barely begun" / zavis wordmark

Each beat should imply a visual AND imply a narration line. If you can't picture the visual OR you can't hear the narration, the beat isn't clear yet.

Step 3 — Write the hook last, not first

Paradox: the hook is the most important line, but you should write it LAST. Writing it first means you'll over-polish it before you know what the video actually is. Write the body beats first, then the landing, then go back and write the hook so it rhymes with the landing.

Good hooks for this template are:

A specific time marker that feels impossible (Seventy years ago…)
A specific number that lands hard (100 million people in two months.)
A contrarian or counterintuitive frame (They called it the AI winter.)
A cold-open question (What if the future wasn't new?)

Bad hooks:

"Today we're going to talk about…" (classroom voice)
"Have you ever wondered…" (cliché)
"Let's dive in…" (sounds like a podcast intro)

The test: does the hook make the next sentence inevitable? If yes, keep it. If the next sentence could be anything, rewrite.

Step 4 — For each beat, write the narration as full sentences, not fragments

v4's narration was written like: "First came. G P T one. Then three. Each one closer to language itself." and the cloned voice stuttered at every period. v5 rewrote it as full sentences with intentional fragment accents only for list rhythm:

A small lab in San Francisco called OpenAI started scaling it up.
GPT one. GPT two. GPT three.

The first sentence is a complete thought. The three fragments that follow are a RHYTHMIC LIST — three beats that ride the momentum of the first sentence. This is the only context where period-splits are allowed.

Rule of thumb: if a fragment stands alone with a period, it must be part of a list OR it must be the most important sentence in the beat. If it's just pacing, use commas or rewrite.

Step 5 — For each beat, write the visual query with the two-tier strategy

For every beat, ask: "Is this a named entity the viewer will expect to recognize?"

Yes → Tier 1. Query = entity name + disambiguator (year, event, "interview", "launch", "keynote", "demo", "press conference"). Target real news/launch footage.
No → Tier 2. Query = shot description (subject + action + composition + lighting/mood). Target atmospheric b-roll.

Write the query, then IMAGINE the first YouTube result. If it would be a lecture, a talking head about the topic, or generic noise, rewrite. If it would be actual footage of the thing, keep it.

Step 6 — Validate with the parallel-narrative check

For each beat, place the narration next to the caption + the visual. Ask: "If I removed the narration, would the viewer still understand what this beat IS?"

If yes (visual + caption carry the content) → narration is free to add context/consequence/contrast. Good.
If no (visual + caption don't say what's happening) → narration is doing the heavy lifting AND the visual isn't earning its place. Rework the beat.

Concrete example:

Visual: empty snow-covered street
Caption: "THE AI WINTER"
Narration: "For most of the next sixty years almost nothing happened. The hype died, the funding froze, and they called it the AI winter."

Removing the narration, the viewer would still get "AI winter = cold/empty time" from the visual + caption. So the narration is free to add the specifics (sixty years, hype, funding, naming). This is how the two channels work together instead of stepping on each other.

Part 3 — How to integrate Zavis (the brand landing rule)

The three-beat integration

Zavis should enter the narrative in exactly three places, and nowhere else:

Beat N-2 (the closing reflection) — articulates the state of play: "Seventy years of research. Five years of explosion. What was science fiction is now infrastructure." This beat doesn't mention Zavis but sets up the logical space Zavis occupies.

Beat N-1 (the CTA / integrated tagline) — makes an argument about who builds what comes next: "What happens next is being built by the people who use this every day." This beat doesn't say "Zavis" either — it makes a general claim, and Zavis' presence in the next beat is the evidence.

End card — the logo lands, a single integrated tagline extends the argument: "The story has barely begun / zavis is what happens next." The connection is implied, not asserted.

What NOT to do

Do not have the narrator say "Zavis is a [product category]" anywhere in the video. The narration never describes Zavis. The video describes the WORLD, and Zavis is shown, not defined.
Do not have the narrator read a feature list. Features are for product demos, not montages.
Do not have a logo bumper in the middle of the video.
Do not use the word "Zavis" in a caption except on the end card.
Do not render the word "ZAVIS" as a big text element anywhere — the logo asset IS the wordmark (it has a green dot). Rendering "ZAVIS" as text creates a duplicate wordmark, which the EndCard component will auto-suppress, but don't test it.

Why this works

People trust a brand that EARNS its place in a story. They tune out a brand that INTERRUPTS a story. Every time we've violated the three-beat rule in internal tests, engagement dropped. Every time we've respected it, the "this felt like something real" comments went up.

Part 4 — How to source clips (the full strategy)

The sourcing pipeline

For every youtube-clip scene in the script, the orchestrator does this:

search (12 candidates)
    │
    ▼
blacklist filter
  - channel name contains "stock footage" / "free footage" / "no copyright"
    / "free download" / "click to download" / "pexels" / "mixkit" / etc.
  - title contains the same
    │
    ▼
relevance gate (≥20% query-token overlap)
  - tokenize query (stop-word filtered)
  - tokenize title + channel (same)
  - compute overlap as matched_query_tokens / total_query_tokens
  - reject if < 20%
    │
    ▼
score = log10(views) + relevance * 15 + trusted_source_boost + duration_sanity
    │
    ▼
rank desc, try top results in order
    │
    ▼
download top candidate
    │
    ▼
OCR watermark check (tesseract, graceful skip)
  - extract 4 frames
  - OCR each
  - reject if any word appears in ≥60% of frames
    │
    ▼
use clip OR try next candidate OR fall through

The query craft rules

Tier 1 queries target real footage. Include the entity name + a year/event disambiguator + a context word ("interview", "launch", "keynote"). Do NOT include "stock" / "b-roll" / "cinematic" — those actively filter OUT the news clips you want.

Tier 2 queries target stock b-roll. Include shot composition words ("close up" / "macro" / "aerial" / "slow motion" / "time lapse") + mood words ("cinematic" / "dramatic" / "neon" / "archival"). These bias results toward shorter, clip-shaped videos.

Avoid over-specific Tier 2 queries. The v5.1 failure mode: "abandoned empty computer laboratory vintage film grain dusty" — six substantive tokens, no candidate ever matches all of them, the relevance gate rejects everything. Fix: "snow winter empty street cinematic slow" — three substantive tokens, much higher match rate.

Avoid abstract nouns. "Innovation", "future", "intelligence" return mostly watermarked stock slop. Use the visual thing the abstract concept LOOKS LIKE.

Do not use the literal word "AI" in a Tier 2 query. It's too generic and pulls talking-head lectures. Use what AI visually LOOKS like: code editors, server racks, robotic arms, dashboards, data centers.

The trusted source list (boost)

The source-quality filter adds +15 to +40 score for these channels. If your Tier 1 query lands on one of these, you've almost certainly got a clean clip:

News (+20-30): BBC News, Bloomberg, WSJ, Reuters, CNBC, FT, Economist, PBS NewsHour, NBC, CBS, ABC, Sky News, DW News, Associated Press Tech press (+15-20): The Verge, TechCrunch, Wired, MKBHD Official (+25-40): OpenAI, DeepMind, Anthropic, Google AI, Meta AI, Microsoft Research, NVIDIA, Stanford, MIT, TED, TEDx, World Economic Forum, Lex Fridman, Y Combinator Documentary (+15-20): Vox, CNBC Make It, Bloomberg Originals, NOVA PBS

If you want to add to this list, edit packages/pipeline/src/youtube/source-quality.ts.

The blacklist (hard reject)

Any channel or title containing these patterns is hard-rejected:

"stock footage", "stock video", "free stock", "free footage"
"no copyright", "copyright free", "royalty free"
"free download", "download link", "link in description"
"click to download", "click here"
"pexels", "pixabay", "videvo", "videezy", "mixkit", "coverr"
"subscribe for more", "like and subscribe"
"ai generated", "midjourney v", "stable diffusion v"

If a channel you want is being blacklisted, either:

The channel actually is watermarked — find a better source
You need to add a specific override (not supported yet; edit the file)

Part 5 — Full case study: "The Evolution of AI"

This is the actual walkthrough of how the canonical video was built. Every decision is documented.

The initial brief

topic: "The evolution of AI from its 1956 roots to the present day"
duration: 90s
aspectRatio: 16:9

That's all we started with. From this we derived everything else by applying the first principles.

Step 1 — Find the argument

The topic is "evolution of AI." The topic alone is not enough.

Applying principle 4 (a video is a tension curve): we need a dip and a rise. The history gives us the dip naturally — the AI winter, sixty years of nothing happening. That's the setup. The rise is the 2012-present explosion.

The argument becomes: "Seventy years of quiet, five years of explosion, and we're still at the beginning."

Applying principle 5 (brand integration is an outcome): the argument creates logical space for Zavis. "Still at the beginning" implies more is being built, and "the people who use this every day" implies a specific kind of builder — which Zavis is. Zavis is not asserted, it emerges.

Step 2 — Build the tension curve

Using the argument as the spine, the tension curve becomes:

(baseline) Dartmouth seed → (dip) the winter → (spark) Hinton/AlexNet
→ (rising) AlphaGo + Transformer → (inflection) GPT 1-3 + ChatGPT
→ (peak) 100M users, GPT-4, image gen, agents → (reflection) closing
→ (integration) CTA → (landing) end card

16 scenes total. This is the structural decision that drives everything else. Change the scene list, and you're making a different video.

Step 3 — Write the beat sheet

For each of the 16 scenes, write one sentence about what the scene is. Not the narration yet — just the beat. Working file: packages/pipeline/src/script/evolution-of-ai-v5.ts.

Notice the pacing: the first 40 seconds cover 60 YEARS. The last 40 seconds cover 10 years. This is intentional — the dip expands in emotional weight what it lacks in story events, and the rise accelerates because the events accelerate.

Step 4 — Write the narration

For each scene, write the narration as full sentences (not period-split fragments). Target 160 words per minute total, so for a 95s video that's ~250 words of narration.

Apply the parallel-narrative rule: the narrator adds context, not identity. The visual says "Sam Altman at OpenAI", the narrator says "A small lab in San Francisco called OpenAI started scaling it up."

Apply the hook-last rule: write beats 2-16 first, then go back to beat 1 and write the hook. The hook for v5 became: "Seventy years ago a handful of people decided that machines could think." That line earns the rest of the video because it sets up the implicit question — "and then what?" — which the video spends 95 seconds answering.

Step 5 — Write the visual queries with the two-tier strategy

For each scene, classify it as Tier 1 or Tier 2:

| Scene | Tier | Query | Why | |---|---|---|---| | Dartmouth | T2 | 1950s computer scientists laboratory archival footage black and white | No specific entity; atmospheric/historical | | AI winter | T2 | snow winter empty street cinematic slow | Metaphor, not a literal event | | Hinton | T1 | Geoffrey Hinton deep learning interview university Toronto | Named entity; viewer expects Hinton | | AlphaGo | T1 | DeepMind AlphaGo Lee Sedol match 2016 press conference | Iconic specific event | | Transformer | T2 | Google Brain research paper attention is all you need | Named paper, but abstract visual — the "paper" is the entity | | GPT rise | T1 | Sam Altman OpenAI interview 2021 | Named entity | | ChatGPT launch | T1 | ChatGPT launch November 2022 news coverage | Iconic specific event | | Viral | T2 | smartphone users typing notification apps viral adoption | Abstract adoption phenomenon | | GPT-4 | T1 | GPT-4 demo OpenAI launch March 2023 | Named event | | Midjourney | T1 | Midjourney AI art generation demo | Named product | | Agents | T2 | robotic arm assembly line precision factory automation | Abstract "doing" — robotics is the visual | | Closing | T2 | modern data center servers blue lights aerial | Atmospheric "infrastructure" visual |

Step 6 — Run the orchestrator

cd zavis-video-os
npx tsx packages/pipeline/src/orchestrator/cli.ts --version=v5

The orchestrator runs the 6-stage pipeline. The first time it ran (v5 original), the source-quality filter had no relevance gate, and it accepted a Roblox video for "abandoned laboratory." I added the relevance gate. On v5.1 it rejected everything for that query. I broadened the query to "snow winter empty street cinematic slow". v5.2 converged.

Step 7 — Review

The deterministic review checks the output file for:

Audio clipping
Black frames (intentional fades only in the first 0.75s and last 0.75s)
FPS match
Resolution match
Duration within ±0.5s of script (usually auto-rebudget handles this)

For v5.2: deterministic review PASSED, 0 critical issues.

Step 8 — Human review

Watch the video end-to-end. For each scene, verify:

Does the narration match the visual? (tier check)
Is the caption in sync with the spoken word? (alignment check)
Is there a visible watermark? (OCR check + eyes)
Is the end card showing ONE Zavis wordmark? (composition check)

If anything fails, fix it and re-render. Max 3 iterations before escalating to the human.

Step 9 — Deliver

Once the review passes, the output is at generations/<runId>/output.mp4. Open it in QuickTime, share with the team.

What I'd do differently next time

Start with the argument, not the topic. I started with "evolution of AI" and only crystallized the argument when I was already 2 iterations deep. If I'd started with the argument I'd have saved one iteration.

Test Tier 2 queries with ≤3 substantive tokens. v5.1 failed because I wrote six-token Tier 2 queries that no candidate could match. Three tokens is usually enough.

Pre-score the clips before downloading. Right now the orchestrator downloads and then checks. If I added a fast "vision check on the thumbnail" step before downloading, I'd catch more bad candidates earlier.

The `act2-viral` scene is still weak. The query "smartphone users typing notification apps viral adoption" didn't have good matches. Better: rewrite as Tier 1 — "ChatGPT 100 million users news coverage 2023" which would pull actual news footage of the adoption milestone. I left it for a future pass.

Part 6 — The failure modes (so you can avoid them)

Failure mode #1 — The period-split stutter (v4)

Narration like "First came. G P T one. Then three." → cloned voice inserts a micro-pause at every period, sounds stuttery. Fix: full sentences. Save period-splits for intentional list rhythm only.

Failure mode #2 — The three-audios-clashing bug (v2, v3)

The old composition played narration for full file length, background video wasn't fully muted, music was always on. Three audio sources overlapping. Fix: every Audio wrapped in a Sequence with durationInFrames; background video uses muted prop AND volume={0}; preflight refuses to render if any two narrations overlap.

Failure mode #3 — The caption-audio drift (v4)

Captions were distributed proportionally by character count over the scene duration, which doesn't match actual spoken timing. Fix: Captions component now consumes ElevenLabs per-character alignment data from alignment-slices.json and times each phrase to the exact spoken frame.

Failure mode #4 — The watermarked clip (v4)

Source-quality filter only blacklisted known aggregator channels; it didn't check for visible on-screen text. Fix: post-download OCR check via tesseract rejects any clip with persistent static text. Graceful skip if tesseract isn't installed (source filter is then the only defense).

Failure mode #5 — The off-topic clip (v5)

Blacklist only caught aggregators; it didn't reject Roblox videos that happened to not have blacklist keywords. Fix: lexical relevance gate — reject candidates whose title+channel share less than 20% of the query's substantive tokens.

Failure mode #6 — The over-specific query (v5.1)

A six-token Tier 2 query that no candidate could match, relevance gate rejected everything, scene rendered black. Fix: rewrite query with ≤3 substantive tokens. Add an orchestrator warning when all candidates are rejected.

Failure mode #7 — The duplicate Zavis wordmark (v4)

EndCard rendered tagline "ZAVIS" as big text PLUS the logo image (which is itself the "zavis." wordmark). Fix: EndCard component auto-suppresses any tagline that equals "zavis" (case-insensitive, period-stripped).

Failure mode #8 — The 90-second hard cap (v4)

Auto-rebudget was forcing scene durations to fit a 90s total, which crunched the narration's breathing room. Fix: relaxed target to 80-110% of the script's declared duration. Total grows when narration needs it.

Part 7 — Quick-reference checklist (pre-render gate)

Before running the orchestrator, verify:

[ ] The argument (not just the topic) is clear and in one sentence
[ ] The tension curve has a dip and a rise
[ ] The hook is under 8 words and sets up an implicit question
[ ] Every scene has a visual query classified as Tier 1 or Tier 2
[ ] Tier 1 queries contain the entity name + a year/context word
[ ] Tier 2 queries have ≤3 substantive tokens
[ ] No query contains "stock footage" / "free" / "no copyright"
[ ] Narration is written as full sentences (period-splits only for list rhythm)
[ ] No ellipses in any narration text
[ ] Parallel-narrative rule: no narration line duplicates a caption or on-screen text
[ ] Zavis enters the narrative only in the closing beat, CTA, and end card
[ ] End card tagline is NOT the word "Zavis" — it's an integrated message
[ ] Source-audio scenes have 1-3s durations, not more
[ ] The voice profile is zavis (single-pass, tempo 0.85)

If every box is checked, run npx tsx packages/pipeline/src/orchestrator/cli.ts --version=<id> and wait ~5-8 minutes.

Part 8 — When you're stuck

"The hook doesn't feel right"

Write the landing beat first, then write the hook to rhyme with it. If the landing is "The story has barely begun," the hook should imply the opposite of "barely begun" so the video can traverse the distance. Example: "Seventy years ago…" implies "and then what?" The landing answers it.

"The middle feels flat"

Add a contrast beat. If everything is building up, insert a "BUT wait" moment that temporarily reverses the direction. The AI winter in the canonical video is this — it's not a plot necessity, it's a tension-curve necessity.

"A clip doesn't match the narration"

Either (a) the query is wrong — rewrite it and re-source just that scene, or (b) the narration is too specific and the clip can't keep up — loosen the narration.

"The narration sounds robotic"

Check: are you using period-splits? Are there ellipses in the text? Is the voice profile zavis with stability 0.78? Did you write it as classroom prose instead of "talking to a friend"? Read it aloud.

"The video runs over target duration"

That's fine. The target is soft. Only panic if it's over 110% of target. If it is, cut the weakest scene (usually a non-Tier-1 body scene).

"The agent keeps picking bad clips"

Check your queries — are they topic descriptions instead of tier-appropriate queries? Rewrite the query in the script and re-run. If the query is fine, check the orchestrator log to see why each candidate was rejected — maybe your blacklist is too aggressive for this topic.

"The end card has two logos again"

Either the tagline is "ZAVIS" (it should be suppressed automatically, but check) or you've added a second <img> somewhere. Read the EndCard component source.

This playbook is a living document. If you find a failure mode not listed here, add it. If you discover a better first principle, propose it. The canonical sample will rot as the world moves on; the first principles won't.