Zavis Video OS

You are working inside zavis-video-os, an opinionated video pipeline that combines:

A Remotion primitives kit (packages/remotion-kit/) — brand DNA, motion utilities, reusable components
A library of templates (packages/templates/) — one folder per video archetype
A pipeline layer (packages/pipeline/) — script writing, YouTube sourcing, ElevenLabs voiceover, render orchestration, and a self-analytical review loop
Skills (this directory + .claude/skills/) — the agentic intelligence layer

Core principles (NON-NEGOTIABLE)

Never hardcode colors, fonts, sizes, or motion configs. Always import from @zavis/remotion-kit/brand. If a value you need isn't there, propose an addition to brand DNA — don't bypass it.

Brand DNA is single-source-of-truth.

Once a template version ships, it doesn't mutate. Edits create a new version. Every template must accept an aspect ratio and recompose itself accordingly — no fixed dimensions.

Templates are immutable, versioned, and aspect-ratio-aware.

Every render goes through: - Preflight (packages/pipeline/src/review/preflight.ts) — runs BEFORE rendering. Builds the audio timeline from script + voiceover files and refuses to render if any narration overlaps with another or overflows its scene window. This is non-bypassable. - Deterministic post-render — checks audio peaks, RMS levels, black frames, fps, dimensions, duration. - Vision review — extracts sample frames and checks for clip mismatches, broken text, font fallback (this is the "watch the video and tell me what's wrong" pass). If any critical issue is found, the orchestrator iterates (max 3 attempts) by patching the script + re-sourcing the offending scene + re-rendering. You should NEVER deliver a video to the user without running this loop and reporting the result.

Self-analytical review is MANDATORY and runs as a real loop.

Every video — regardless of template — must earn its viewer in seconds 0-3. Load zavis-hook-engineer skill whenever you're writing or reviewing the opening of a video.

Hooks live in the first 3 seconds.

If the narrator is saying something and a text overlay is showing the same thing, one of them is wasted. Load zavis-narration-director for any narration work.

Narration vs visual must not say the same thing.

Use the zavis narrator profile (cloned voice). The pipeline generates the entire script's narration in ONE ElevenLabs call, then ffmpeg-cuts it into per-scene clips using the alignment data. This produces continuous, natural-sounding narration with one API call instead of N. See packages/pipeline/src/voiceover/single-pass.ts.

Single-pass narration by default.

- Tier 1 (named entities): when the narration mentions a specific recognizable person/product/event, query WITH THE ENTITY NAME. Example: narration says "Sam Altman" → query = "Sam Altman OpenAI interview 2021". The viewer expects to see Sam Altman. - Tier 2 (concepts/atmosphere): when narration describes a feeling, metaphor, or abstract moment, query with a visual SHOT description. Example: narration says "the AI winter" → query = "abandoned empty computer laboratory vintage film grain dusty". The source-quality filter (packages/pipeline/src/youtube/source-quality.ts) blacklists stock-footage aggregators and boosts trusted sources (Bloomberg, Reuters, OpenAI, DeepMind, TED, Lex Fridman, etc.). Load zavis-clip-curator whenever writing or reviewing visual queries.

Clip queries use the two-tier strategy.

Only HOOK / TITLE-CARD / CTA / OUTRO scenes use HeroCard (big text on dark background). Body scenes use ClipScene (full-bleed video) with a small LowerThird corner legend (year/chapter marker). Big text fighting a video is the wrong default.

Body scenes are clip-primary, not text-primary.

The single-pass voiceover module exports per-scene alignment slices (alignment-slices.json) — the character-level timing from ElevenLabs /with-timestamps. The Captions component consumes this to time each phrase to the exact frame the word is spoken. Never let captions drift off the audio; if you see drift, the alignment wiring is broken.

Captions sync to character timestamps, not scene boundaries.

The EndCard component auto-suppresses any tagline that equals "zavis" to prevent rendering a duplicate wordmark alongside the logo image. The tagline should be an INTEGRATED message that extends the video's argument, not a generic brand shout.

One Zavis wordmark per end card.

The orchestrator calls checkForWatermarks() after each clip downloads, rejecting any clip with persistent static text (OCR via tesseract). If no clean candidate exists for a scene, the orchestrator warns and renders a black fallback — that's a signal to rewrite the query, not to proceed.

Watermark defense runs on every YouTube download.

The auto-rebudget loop extends scene durations to fit the actual measured narration length. A "90-second video" is allowed to run 80-110 seconds as long as pacing is preserved. Never enforce a hard duration cap.

Duration targets are flexible (80-110% of target).

Skill loading order

When the user says "make a video about X," load skills in this order:

`zavis-master` (global) — Zavis brand context, voice, audience
`zavis-video-os` (this skill) — system orientation
`zavis-taste-dna` — visual aesthetic discipline (X& reference)
`zavis-template-{name}` — the template you'll use (if you know which one)
`zavis-story-writer` — narrative structure
`zavis-hook-engineer` — first-3-seconds craft
`zavis-engagement-engineer` — retention budget, escalation, CTA integration
`zavis-narration-director` — voiceover writing (single-pass discipline)
`zavis-clip-curator` — visual SHOT queries, not topic queries
`zavis-video-reviewer` — what the review loop checks for

If you don't know which template fits, load `zavis-template-router` and ask the user clarifying questions.

How to actually make a video

The end-to-end flow is:

Brief — gather user intent (topic, duration, aspect ratio, intent, references)
Script — produce a ScriptDocument (see packages/shared/src/types.ts). For the first iteration of a new template, hand-author the script directly. Once the template stabilizes, an LLM step produces the script from the brief.
Asset sourcing — for YouTube montage templates, source clips via packages/pipeline/src/youtube/. For talking-head templates (future), generate via Hedra/D-ID.
Voiceover — generate via packages/pipeline/src/voiceover/ (ElevenLabs)
Render — invoke the template with the brief + script + assets
Review — packages/pipeline/src/review/ analyzes the output frame-by-frame and produces a ReviewReport
Iterate — if the report has critical issues, fix them and re-render. Repeat up to 3 times.
Deliver — only show the user the result AFTER review passes (or after escalation to human).

File layout to remember

Brand DNA: packages/remotion-kit/src/brand/index.ts
Reusable components: packages/remotion-kit/src/components/
Motion utilities: packages/remotion-kit/src/motion/index.ts
Templates: packages/templates/<name>/src/
Shared types: packages/shared/src/types.ts — read this any time you touch a script, brief, or render manifest
Pipeline modules: packages/pipeline/src/{youtube,voiceover,script,review,render,orchestrator}/
Run outputs: generations/<run-id>/
Skills: skills/<skill-name>/SKILL.md

What to NEVER do

Never use CSS animations or transitions in Remotion components (they don't render correctly)
Never use useFrame() from @react-three/fiber (it causes flickering — use Remotion's useCurrentFrame() instead)
Never bypass the review loop
Never hardcode an aspect ratio in a template
Never ship a video where the narrator and on-screen text say the exact same thing
Never mutate an existing template version — create a new version