SAMA Framework Improves Instruction-Guided Video Editing Precision
- •SAMA factorizes video editing into semantic anchoring and motion modeling to balance edits and motion.
- •Framework achieves state-of-the-art performance among open-source models, rivaling top commercial video systems.
- •Novel two-stage training enables zero-shot editing capabilities without initial paired video-instruction data.
Current AI video editing often struggles with a persistent tug-of-war between following user instructions and maintaining the original motion's integrity. When a user asks an AI to change a character's clothing, the model often inadvertently alters the character's movement or the background stability. SAMA (Factorized Semantic Anchoring and Motion Alignment) circumvents this bottleneck by factorizing the editing task into two distinct, specialized sub-processes that handle appearance and movement independently.
The first stage, Semantic Anchoring, acts as a structural planner for the edit. It identifies key visual anchors across sparse frames to ensure the new content fits the scene's logic without getting lost in complex backgrounds. By predicting semantic tokens, the model establishes a reliable blueprint for the requested modification before any motion is even considered. This provides a stable foundation that prevents the 'warping' effects common in less sophisticated video tools.
The second stage, Motion Alignment, focuses entirely on the fluid dynamics of the video. The researchers pre-train the model on motion-centric tasks like filling in missing video sections (inpainting) or adjusting playback speeds. This allows the AI to internalize how objects move naturally in the real world. By separating these functions, SAMA achieves high-fidelity results that rival commercial heavyweights like Kling-Omni while providing the transparency and accessibility of an open-source framework.