CoPE-VideoLM LogoCoPE-VideoLM
Codec Primitives For Efficient Video Language Models
Sayan Deb Sarkar 1,2*
Rémi Pautrat 2
Ondrej Miksik 2
Marc Pollefeys 2,3
Iro Armeni 1
Mahdi Rad 2
Mihai Dusmanu 2
1 Stanford University
2 Microsoft Spatial AI Lab
3 ETH Zurich
* Part of work done at Microsoft    Equal supervision
Stanford University Microsoft ETH Zurich Microsoft
CoPE-VideoLM Teaser
TL;DR: Replace dense per-frame image embeddings, use video codec primitives to reduce TTFT by up to 86% and token usage by up to 93% while maintaining video understanding performance.
Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.
Token Efficiency vs. Video QA Accuracy

Performance of LLaVA-Video (7B) at different number of keyframes per GOP, as well as in the default setup of selecting 64 keyframes regardless of video length compared to CoPE-VideoLM.

Benchmark Performance

Comprehensive evaluation across multiple video benchmarks, covering four categories: (i) general video question answering, (ii) temporal reasoning and motion understanding, (iii) long-form and instruction-following tasks, and (iv) spatial scene understanding.

Primary comparison with LLaVA-Video-7B (base model) alongside several closely related open-source approaches, all evaluations conducted using lmms-eval to ensure consistency.

Interactive Runtime Comparison

User chat experience comparing LLaVA-Video and CoPE-VideoLM at 1 FPS video input.

Input Video
Chat
9:41
LLaVA-Video
Chat
9:41
CoPE-VideoLM
Chat Completion Progress
LLaVA-Video
CoPE-VideoLM
Runtime and Memory
Runtime Comparison

TTFT and E2EL for generating 64 text tokens at several keyframe densities compared to the 64 keyframe baseline at 1 FPS.

Token Budget vs. Video Length

Theoretical plot showing token efficiency across configurations - enable scaling to longer videos without exceeding context limits.

Methodology

Overview of the method

Given a video in its raw codec representation, our framework leverages the GOP structure for efficient, codec-aware tokenization. I-frames are processed by a standard frozen vision encoder (φRGB) to produce dense RGB tokens. P-frames, however, bypass full RGB decoding. Their raw components, motion vectors and residuals, are instead fed into our lightweight Δ-Encoder (φΔ) to generate a small set of highly compact Δ-tokens. The final token stream, an interleaved sequence of I-frame tokens and Δ-tokens, is consumed by the LLM, enabling dense temporal coverage at a fraction of the standard token count and runtime.

Delta Encoder architecture

Δ-Encoder processes motion vectors and residuals through two lightweight branches designed to extract and compress codec-domain information. The resulting motion and residual tokens are concatenated to form the Δ-tokens used for P-frames, providing an efficient representation that is projected to the RGB token space during pre-training.

Training Paradigm. First, the Δ-Encoder is pre-trained to align its output with the frozen vision encoder. The resulting features are aligned with ground-truth image tokens via a patch-wise MSE loss, enforcing spatially consistent alignment across patches. After pre-training, the Δ-Encoder is integrated into the VideoLM for end-to-end fine-tuning — the reference-conditioned branches from pre-training are dropped, so no RGB reference frames are processed for P-frames. This yields a substantial compute and memory reduction while keeping the standard instruction tuning objective unchanged.

Acknowledgements

We would like to thank (in alphabetical order): Isar Meijer, Kevin Qu and Krzysztof Waraksa from Microsoft for help with training pipeline setup; Tao Sun and Jianhao Zheng from Stanford for feedback at different stages of the project.
Website template inspired by GuideFlow3D.

Citation

If you find our work useful, please consider citing:

@misc{cope_videolm, title={CoPE-VideoLM: Codec Primitives For Efficient Video Language Models}, author={Sayan Deb Sarkar and Rémi Pautrat and Ondrej Miksik and Marc Pollefeys and Iro Armeni and Mahdi Rad and Mihai Dusmanu}, year={2026}, eprint={2602.13191}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.13191}, }

COPE-VIDEOLM: Codec Primitives For Efficient Video Language Models
Contact Us Privacy & Cookies Consumer Health Privacy Terms of Use Trademarks © 2025 Microsoft