The Script is All You Need: An Agentic Framework for Long-Horizon Dialogue-to-Cinematic Video Generation
- •Tencent Hunyuan introduces end-to-end agentic framework translating raw dialogue into coherent, cinematic-quality videos.
- •Framework utilizes ScripterAgent for scriptwriting and DirectorAgent for cross-scene video model orchestration.
- •Research debuts ScriptBench dataset and Visual-Script Alignment metric to measure long-horizon narrative consistency.
Current AI video generation often excels at short, visually stunning clips but falters when tasked with maintaining a narrative thread over longer durations. To solve this "semantic gap" between a creative idea and its cinematic execution, researchers from Tencent Hunyuan have unveiled an agentic framework that treats filmmaking as a multi-step orchestration process rather than a single prompt-to-video task. The system centers on two primary components: ScripterAgent and DirectorAgent. ScripterAgent acts as the screenwriter, transforming vague dialogue into detailed, executable cinematic scripts. These scripts serve as the blueprint for DirectorAgent, which manages various state-of-the-art video models using a cross-scene continuous generation strategy. This ensures that characters, settings, and lighting remain consistent across multiple scenes—a capability known as long-horizon coherence—which has previously been difficult for standalone models. To support this work, the team developed ScriptBench, a large-scale benchmark designed to evaluate how well AI can align visual output with complex narrative scripts. Their findings highlight a critical trade-off: while some models produce high-fidelity "spectacle," they often drift away from the original script’s instructions. This research marks a significant step toward fully automated cinematic production by introducing a new Visual-Script Alignment (VSA) metric to quantify narrative faithfulness.