Zavis Clip Curator

The two-tier rule (this is the most important thing in this skill)

Every scene with visual content falls into one of two tiers. The query strategy is different for each, and getting the tier wrong is the single biggest reason videos feel bad.

Tier 1 — Named-entity scenes

The narration mentions a specific, recognizable person, product, event, or company. The viewer expects to see that specific thing. Showing generic b-roll here is "almost criminal" (quoting the user who kicked this skill into existence).

Rule: query with the entity name itself, plus a disambiguator for context.

| Narration mentions | Tier 1 query | |---|---| | "Sam Altman" | Sam Altman OpenAI interview 2021 | | "Geoffrey Hinton" | Geoffrey Hinton deep learning interview university Toronto | | "ChatGPT launch" | ChatGPT launch November 2022 news coverage | | "GPT-4" | GPT-4 demo OpenAI launch March 2023 | | "DeepMind AlphaGo / Lee Sedol" | DeepMind AlphaGo Lee Sedol match 2016 press conference | | "Midjourney / DALL-E" | Midjourney AI art generation demo |

These queries target real news clips, official launch events, interviews, and conference talks. Thanks to the source-quality filter (packages/pipeline/src/youtube/source-quality.ts) they get boosted when they come from Bloomberg, Reuters, CNBC, The Verge, OpenAI's own channel, etc.

Tier 2 — Concept / atmosphere scenes

The narration describes a feeling, a trend, a metaphor, or an abstract moment. The viewer has no specific expectation — they need a visual that MATCHES THE EMOTION.

Rule: query with a shot description — subject + action + composition + lighting.

| Narration says | Tier 2 query | |---|---| | "The AI winter" | abandoned empty computer laboratory vintage film grain dusty | | "Dartmouth summer of 1956" | 1950s computer scientists laboratory archival footage black and white | | "A hundred million users" | smartphone users typing notification apps viral adoption | | "What was science fiction is now infrastructure" | modern data center servers blue lights aerial | | "They started doing the work themselves" | robotic arm assembly line precision factory automation |

The shot version returns stock b-roll designed to be cut into other people's videos, and the filter prefers shorter, more atmospheric clips.

How to decide which tier

Ask: "Is there a specific thing the viewer will feel cheated if they don't see?"

If yes → Tier 1, query with the entity name.
If no → Tier 2, query with a shot description.

When in doubt, err Tier 1 for recent/recognizable events (anything post-2015 with a named product/person), and Tier 2 for historical or abstract moments.

The sourcing pipeline (as of v5)

The pipeline lives in packages/pipeline/src/youtube/ and does the following for each scene:

Search — searchYouTube(query, { maxResults: 12 }) — pulls 12 candidates so the quality filter has room to work.
Rank — rankResults(raw) applies the blacklist-then-boost scorer in source-quality.ts. Stock-aggregator channels and watermarked titles are HARD rejected. Trusted sources (BBC, Bloomberg, CNBC, Reuters, The Verge, TED, OpenAI, DeepMind, MIT, Stanford, Lex Fridman, etc.) get +15 to +40 score boosts. View count gives the baseline.
Download each top-ranked candidate in order until one succeeds.
Watermark check — checkForWatermarks(localPath) runs tesseract OCR on 4 sampled frames and rejects clips where the same word appears in ≥60% of frames (persistent static text = watermark/chyron/channel bug). If tesseract is not installed, it skips gracefully and relies on the source filter as the primary defense.
Trim — mode-aware: source-audio clips trim around the keyword timestamp from the transcript; muted-broll clips take a middle slice.
Copy into the template's public/ folder and the composition plays it.

If ALL candidates are blacklisted for a scene, the orchestrator warns and moves on — the scene renders with a black fallback. That's a signal to rewrite the query, not to proceed.

The shot vocabulary

Build your queries from these proven shot patterns:

Subject + action + composition

[subject] [action] [shot type]
"scientist writing on chalkboard close up"
"robot arm welding factory wide shot"
"person typing laptop keyboard macro"

Subject + setting + mood

"abandoned office snow falling outside window cinematic"
"neon city street rainy night reflections"
"sunrise over mountain ranges dramatic"

Texture + macro

"paint brush stroke canvas slow motion macro"
"human eye iris extreme close up cinematic"
"circuit board electricity flowing macro"

Era markers

For historical: add "1950s" / "1970s" / "vintage" / "archival" / "black and white"
For futuristic: add "futuristic" / "sci-fi" / "neon" / "cyberpunk"
For present: usually omit era markers — modern is the default

Pacing markers

"slow motion" — for emphasis moments
"time lapse" — for "things changed quickly"
"macro" — for zoom-in feel
"aerial" / "drone" — for "vast scope"

What to avoid in Tier 2 queries (concept scenes)

These rules apply to Tier 2 only. Tier 1 queries break these rules on purpose.

Brand names in Tier 2: pulling "OpenAI" for a generic "AI is changing everything" scene returns their product demo reel, which clashes with your narration. Save brand names for Tier 1.
Abstract concepts: "intelligence" / "innovation" / "future" — these return mostly stock footage with watermarks. Use what the concept VISUALLY LOOKS LIKE instead.
The literal word "AI": too generic for Tier 2. Use servers, code, robotics, dashboards.
Queries that return only news broadcasts in Tier 2: if the narration isn't about a specific news event, news footage will clash because the anchor's lips are moving.

What to include in Tier 1 queries (named-entity scenes)

The entity name — non-negotiable, this is the whole point.
A year or event disambiguator: "Sam Altman 2021" or "GPT-4 demo March 2023" — narrows results to the right era.
A context word: "interview" / "press conference" / "keynote" / "launch" / "demo" — tells YouTube what kind of footage you want.
The organization or setting when relevant: "OpenAI" / "DeepMind" / "Stanford".

Do NOT add "stock footage" / "b-roll" / "cinematic" to Tier 1 queries — that actively filters OUT the news clips and interviews you want.

How to write queries from a script

When you're writing a script and reaching the visual direction for each scene, ask:

What is the narrator saying at this moment?
What is the EMOTION the viewer should feel?
What is the most VISUALLY ARRESTING way to depict that emotion?
How would a professional cinematographer shoot it?

Then write a query that describes that cinematographer's shot.

Example walkthrough

Narration: "For sixty years. Almost nothing happened. The hype died. The funding froze. They called it the AI winter."

Emotion: cold, abandoned, time passing slowly
Most visually arresting depiction: a literal winter scene that mirrors the metaphor
Cinematographer's shot: an empty office with snow falling outside the window, dust on equipment, low light
Query: "abandoned office snow falling outside window cinematic"

NOT "AI winter history funding" — that returns documentaries.

Quality filters (enforced automatically)

The pipeline enforces these automatically now — see source-quality.ts:

Hard blacklist (channel or title contains any of these):

"stock footage", "free footage", "no copyright", "copyright free", "royalty free"
"free download", "download link", "link in description", "click to download"
"pexels", "pixabay", "videvo", "videezy", "mixkit", "coverr" (screen recordings of stock sites)
"subscribe for more", "like and subscribe"

Trusted source boost (+15 to +40 on score):

Tier 1 news: BBC News, Bloomberg, WSJ, Reuters, CNBC, FT, The Economist, PBS NewsHour
Tier 2 tech press: The Verge, TechCrunch, Wired, MKBHD
Tier 3 official: OpenAI, DeepMind, Anthropic, Google AI, Microsoft Research, Stanford, MIT, TED, Lex Fridman, Y Combinator, World Economic Forum

Post-download OCR check (if tesseract is installed):

Extracts 4 frames evenly spaced in the clip
Runs OCR on each
Rejects if any word appears in ≥60% of frames (persistent static text = watermark)
Also rejects on specific watermark keywords: "copyright", "subscribe", "download", "click", "free", "shutterstock", "gettyimages", "istock"

YouTube-only for the youtube-montage template. This is not a stock-footage template — the whole point is to use real footage from real sources. Do not add Pexels / Mixkit fallbacks to this template.

How to review a video for clip mismatches

When watching a rendered video, ask for each scene:

Does the visual match what I'm hearing? If the narrator says "neural networks" and I see a person walking on a street, that's a mismatch.
Is the visual on-topic OR atmospherically aligned? Both are fine. A literal match is best, an atmospheric match is acceptable, no match is a critical issue.
Does the visual contain its own competing narrative? If I see a news anchor's mouth moving and they're clearly saying words (even though muted), that's distracting and reads as wrong.
Is the visual at the right pace? A static shot under fast narration feels slow. A fast-paced clip under slow narration feels chaotic.

For every clip mismatch you find, write a NEW query (using the rules above) and re-source that one scene only.

The DOs and DON'Ts cheat sheet

DO:

✓ Write queries as visual shot descriptions
✓ Include shot type (close up / wide / macro / aerial / time lapse)
✓ Include lighting/mood (cinematic / dramatic / neon / golden hour)
✓ Use Pexels-friendly queries for atmospheric scenes
✓ Re-source individual scenes that don't match instead of redoing everything
✓ Test queries by running them on YouTube/Pexels manually and checking the top result

DON'T:

✗ Use the topic of the video as the query
✗ Use brand names unless you need a brand reveal
✗ Use abstract nouns ("innovation," "future," "intelligence")
✗ Use queries that return news broadcasts or interviews
✗ Write queries longer than ~10 words
✗ Forget shot type — every query should suggest a CAMERA, not just a TOPIC