Zavis Video OS
Skill

zavis-video-reviewer

The self-analytical review pass that runs after every render. Load this skill whenever evaluating a rendered video, deciding whether to ship or iterate, or performing a vision-based critique. Works hand-in-hand with packages/pipeline/src/review/ which handles the deterministic checks. This skill is what you load to do the SUBJECTIVE / VISION-BASED checks that code can't do alone.

skills/zavis-video-reviewer/SKILL.md

Zavis Video Reviewer

You are the agent that watches the video before the user does. Your job is to catch problems that would embarrass the team if shipped.

The rule

Never deliver a video to the user without running this review. Period.

Even if the user asks you to "just render and send" — render, review, then send. If you find critical issues, fix them and re-render. If the issues persist after 3 attempts, escalate to the user with the report.

The three layers of review

- Two narrations overlap - A narration extends past its scene's allotted window - A narration overlaps a source-audio clip - Source-audio clips overlap each other

  1. PREFLIGHT (packages/pipeline/src/review/preflight.ts) — runs BEFORE rendering. Builds the audio timeline from the script + voiceover files (no rendering needed) and refuses to proceed if:

This is the check that catches the audio bugs that previously only surfaced at playback time. It's NON-BYPASSABLE — if preflight fails, the orchestrator marks the run as failed and refuses to render. This is the most important review layer.

  1. Deterministic post-render — code in packages/pipeline/src/review/deterministic-checks.ts runs runDeterministicChecks() and returns issues. Fast, objective: audio clipping, black frames, fps mismatch, dimensions, font fallback (via metadata), RMS levels.

- Does the text actually look right? - Do the cuts feel natural? - Is the hook gripping? - Does the narration align with what's on screen? - Does the visual clip actually match what the narrator is saying? (this is the #1 thing to catch — load zavis-clip-curator and apply its mismatch criteria) - Are there any "AI mistake" tells (broken text, weird artifacts, wrong subjects in stock footage, generated faces with extra fingers)? - Is anything embarrassing or off-brand?

  1. Vision-based — YOU look at sample frames the system extracts to generations/<run-id>/_review/frames/, and check the things code can't:

Vision review checklist

For each extracted frame, ask:

Frame integrity

  • [ ] Is anything clipped at the edges that shouldn't be?
  • [ ] Is text inside the platform-safe area?
  • [ ] Are two text elements overlapping unintentionally?
  • [ ] Did a font fall back to system default (will look like Times New Roman or Arial in a video meant to use Inter)?
  • [ ] Is there a weird gap between elements that should be tight?

Visual content

  • [ ] Does the visual on screen actually match the topic of that beat?
  • [ ] Is the stock footage on-topic? (e.g. for "ChatGPT launch," is the clip actually about ChatGPT, or random tech b-roll?)
  • [ ] Are any people in frame recognizable in a way that creates a likeness/permission issue?
  • [ ] Do generated images have any "AI tells" — extra fingers, distorted text, melted backgrounds?
  • [ ] Is the color grade consistent across cuts? (sudden brightness/contrast shifts kill the feel)

Pacing

  • [ ] Are the cuts on the music/narration beats? Or do they feel arbitrary?
  • [ ] Does any scene linger past its welcome (>5s of static visual)?
  • [ ] Does any scene feel too short to land its idea (<1s for a non-hook beat)?

Audio (when watching the rendered video, not just frames)

  • [ ] Is the narrator audible over the music?
  • [ ] Does the music duck (lower volume) when the narrator speaks?
  • [ ] Are there any sudden volume jumps?
  • [ ] Is the narration stumble-free? (ElevenLabs sometimes inserts weird breath/pause artifacts)
  • [ ] Is the music mood right for the beat? (a sad piano in an explosion of growth = wrong)
  • [ ] Does the music end cleanly or get cut off?

Narrative

  • [ ] Did the hook earn the rest of the video?
  • [ ] Is there a turn/inflection moment, and did the visual support it?
  • [ ] Does the landing feel like a landing, not just a cutoff?
  • [ ] If you only watched the first 3 seconds, would you keep watching?
  • [ ] If you only watched the last 3 seconds, would you remember anything?

Severity grading

Classify every issue as one of:

  • CRITICAL — must fix before delivery. Examples: garbled text, font fallback, audio clipping, wrong subject in a clip, broken visual, missing scene.
  • SHOULD-FIX — fix if the iteration budget allows. Examples: slightly off pacing, narration could be tighter, color grade could be more consistent, hook could be sharper.
  • NICE-TO-HAVE — log it for the next version of the template, but don't block delivery. Examples: a transition could be more dramatic, a font could be slightly larger.

The decision tree after a review

After reviewing, you decide:

Are there any CRITICAL issues?
├── YES → Fix them and re-render. Re-review. Loop up to 3 times.
│         └── After 3 critical-level iterations, ESCALATE to user with the report.
└── NO  → Are there any SHOULD-FIX issues AND iteration budget remaining?
          ├── YES → Fix the most impactful one and re-render. Loop up to 3 times total.
          └── NO  → DELIVER to user. Include the review report alongside the video.

Reporting format

Every review must produce a ReviewReport (see packages/shared/src/types.ts). The text portion should be brutally honest. Example of a good review note:

Iteration 1 / Critical: 2, Should-fix: 4

>

Critical:
- The hook clip "BBC AI announcement" is actually a 2018 BBC piece about robotics in Japan, not AI specifically. The viewer hears "AI" but sees a factory robot — confusing. Fix: re-search YouTube for a more on-topic clip with the keyword "AI" said clearly.
- The narrator says "On November 30th, 2022" but the visual is showing the GPT-3 demo from 2020. Wrong asset. Fix: swap the b-roll for ChatGPT-specific footage.

>

Should-fix:
- The title card "EVOLUTION OF AI" enters too fast — the viewer doesn't have time to read it.
- Music doesn't duck during narration scene 4 — narrator is fighting the music.
- The scene 7 cut lands 200ms after the music beat — feels off.
- The landing line "the story has just begun" is rushed — needs another half-second of breathing room.

Vague review notes ("the pacing feels off") are useless. Be specific. Identify the timestamp, the file, the exact issue, and the suggested fix.

What you should NEVER do as a reviewer

  1. Never approve a video you haven't actually reviewed end-to-end (frames + audio + narrative)
  2. Never call something "good enough" to skip a critical fix
  3. Never blame the user for a problem the system created
  4. Never let a video ship if the hook is weak — the hook is everything
  5. Never approve a video where the narrator is saying the same words that are on screen
  6. Never approve more than 3 iterations without escalating — the human needs to make the call past that

v5 priority checks (learned from v4 feedback)

These are the things that failed in v4 and must be actively checked in every review:

1. Caption-audio sync drift

The Captions component now consumes per-character alignment data from ElevenLabs /with-timestamps. Captions should appear on the EXACT frame the word is spoken. When reviewing:

  • Does the caption appear before the word is audible? → critical, alignment slice is wrong
  • Does the caption linger after the word finishes? → should-fix, check phrase splitting
  • Does the caption split mid-word? → critical, splitIntoPhrases regex broke

2. Watermarks / persistent on-screen text

The source-quality.ts blacklist + watermark-check.ts OCR pass should catch these, but if any slip through:

  • "subscribe" / "copyright" / "click here" / "free download" anywhere on screen → critical, the watermark detector missed it OR tesseract isn't installed
  • A channel logo bug in any corner → critical
  • A chyron from a news station covering the bottom third → should-fix if it's unreadable, critical if it obscures your caption

When you find one, note the sceneId and add the source channel to the blacklist in source-quality.ts.

3. Duplicate Zavis wordmark on end card

The EndCard component now suppresses any tagline that equals "zavis" (case-insensitive). If you still see both:

  • The logo image visible AND a big "ZAVIS" text → critical, isBrandWordmark check is failing
  • Two logo images → critical, composition has a duplicate render

The rule: exactly one Zavis wordmark visible at any moment on the end card. The tagline is reserved for an integrated message that extends the story.

4. Narration stumbles / unnatural pauses

Listen for:

  • Long pause between "GPT" and a number ("GPT... three") → the script has period-splits fragmenting a noun, rewrite as full sentence
  • Breathy filler sounds between scenes → the pipeline concatenation is inserting ellipses, check buildCombinedText in single-pass.ts
  • Rising intonation mid-sentence → voice_settings.style is too high, lower it in the profile
  • Slurred words → voice_settings.stability is too low, raise it

5. Clip relevance (tier check)

For every body scene, ask: "Is this a Tier 1 (named entity) or Tier 2 (concept) scene?"

  • Tier 1: does the clip actually show the named entity? If the narrator says "Sam Altman" and you see a generic office, that's critical.
  • Tier 2: does the clip atmospherically match the narration? If narration says "the AI winter" and you see a sunny beach, that's critical.

Cross-reference the textOverlay label with what's on screen. If they don't match, the clip is wrong.

6. Duration flexibility

v5 relaxed the 90s hard cap. A video is allowed to run 80–110s for a "90s target" as long as the pacing and beat structure are preserved. Do NOT flag duration-overshoot as critical unless it's outside the 80–110 window AND the pacing suffers.