HERMES: KV Cache as Hierarchical Memory for Efficient Streaming Video Understanding
- •HERMES architecture enables real-time streaming video understanding without the need for additional model training.
- •System achieves 10x faster Time to First Token and reduces video token usage by 68%.
- •Hierarchical memory framework maintains constant GPU memory usage regardless of video length to prevent crashes.
Modern Multimodal Large Language Model (MLLM) systems excel at analyzing static video files but often stumble when forced to process live video streams in real time. These models typically struggle to balance high accuracy with the heavy memory demands and latency required for continuous input. HERMES addresses this bottleneck by reimagining the model's internal storage—specifically the KV cache (a specialized memory that stores previous mathematical computations)—as a tiered, hierarchical memory system. Instead of treating all video data equally, HERMES categorizes information into sensory, working, and long-term memory across different layers of the neural network. This mimics human cognitive processing, where shallow layers capture immediate events and deeper layers anchor long-term semantic meaning. This training-free approach allows for a plug-and-play implementation, meaning developers can enhance existing models without the costly and time-consuming process of retraining them from scratch. The results are striking: HERMES delivers a 10x improvement in the Time to First Token (TTFT), the speed at which a model begins generating its response. Remarkably, it maintains high accuracy even when discarding up to 68% of redundant video tokens. By keeping GPU memory usage constant regardless of video duration, HERMES effectively eliminates the memory errors that frequently plague long-form video processing, paving the way for more responsive AI assistants in live-stream environments.