CoPE-VideoLM

CoPE-VideoLM
Codec Primitives For Efficient Video Language Models

Sayan Deb Sarkar^{1,2^*}

Rémi Pautrat¹

Ondrej Miksik¹

Marc Pollefeys^1,3

Iro Armeni²

Mahdi Rad^{1 ^†}

Mihai Dusmanu^{1 ^†}

¹Microsoft Spatial AI Lab

²Stanford University

³ETH Zurich

^* Part of work done at Microsoft ^† Equal supervision

Paper

arXiv

Code Coming Soon

Demo Coming Soon

TL;DR: Replace dense per-frame image embeddings, use video codec primitives to reduce TTFT by up to 86% and token usage by up to 93% while maintaining video understanding performance.

Video Language Models (VideoLMs) empower AI systems to understand temporal dynamics in videos. To fit to the maximum context window constraint, current methods use keyframe sampling which can miss both macro-level events and micro-level details due to the sparse temporal coverage. Furthermore, processing full images and their tokens for each frame incurs substantial computational overhead. To address these limitations, we propose to leverage video codec primitives (specifically motion vectors and residuals) which natively encode video redundancy and sparsity without requiring expensive full-image encoding for most frames. To this end, we introduce lightweight transformer-based encoders that aggregate codec primitives and align their representations with image encoder embeddings through a pre-training strategy that accelerates convergence during end-to-end fine-tuning. Our approach reduces the time-to-first-token by up to 86% and token usage by up to 93% compared to standard VideoLMs. Moreover, by varying the keyframe and codec primitive densities we are able to maintain or exceed performance on 14 diverse video understanding benchmarks spanning general question answering, temporal reasoning, long-form understanding, and spatial scene understanding.

Token Efficiency vs. Video QA Accuracy

Performance of LLaVA-Video (7B) at different number of keyframes per GOP, as well as in the default setup of selecting 64 keyframes regardless of video length compared to CoPE-VideoLM.

Benchmark Performance

Comprehensive evaluation across multiple video benchmarks, covering four categories: (i) general video question answering, (ii) temporal reasoning and motion understanding, (iii) long-form and instruction-following tasks, and (iv) spatial scene understanding.

Primary comparison with LLaVA-Video-7B (base model) alongside several closely related open-source approaches, all evaluations conducted using lmms-eval to ensure consistency.

Interactive Runtime Comparison

User chat experience comparing LLaVA-Video and CoPE-VideoLM at 1 FPS video input.

Input Video

Chat

9:41

LLaVA-Video

Chat

9:41

CoPE-VideoLM

Chat Completion Progress

LLaVA-Video

CoPE-VideoLM

Runtime and Memory

Runtime Comparison

TTFT and E2EL for generating 64 text tokens at several keyframe densities compared to the 64 keyframe baseline at 1 FPS.

Token Budget vs. Video Length

Theoretical plot showing token efficiency across configurations - enable scaling to longer videos without exceeding context limits.

Methodology

Overview of the method

Given a video in its raw codec representation, our framework leverages the GOP structure for efficient, codec-aware tokenization. I-frames are processed by a standard frozen vision encoder (φ_RGB) to produce dense RGB tokens. P-frames, however, bypass full RGB decoding. Their raw components, motion vectors and residuals, are instead fed into our lightweight Δ-Encoder (φ_Δ) to generate a small set of highly compact Δ-tokens. The final token stream, an interleaved sequence of I-frame tokens and Δ-tokens, is consumed by the LLM, enabling dense temporal coverage at a fraction of the standard token count and runtime.

Δ-Encoder processes motion vectors and residuals through two lightweight branches designed to extract and compress codec-domain information. The resulting motion and residual tokens are concatenated to form the Δ-tokens used for P-frames, providing an efficient representation that is projected to the RGB token space during pre-training.

Training Paradigm. First, the Δ-Encoder is pre-trained to align its output with the frozen vision encoder. The resulting features are aligned with ground-truth image tokens via a patch-wise MSE loss, enforcing spatially consistent alignment across patches. After pre-training, the Δ-Encoder is integrated into the VideoLM for end-to-end fine-tuning — the reference-conditioned branches from pre-training are dropped, so no RGB reference frames are processed for P-frames. This yields a substantial compute and memory reduction while keeping the standard instruction tuning objective unchanged.

Acknowledgements

We would like to thank (in alphabetical order): Isar Meijer and Krzysztof Waraksa from Microsoft for help with training pipeline setup; Tao Sun and Jianhao Zheng from Stanford for feedback at different stages of the project.

Citation

If you find our work useful, please consider citing:

@misc{cope_videolm, title={CoPE-VideoLM: Codec Primitives For Efficient Video Language Models}, author={Sayan Deb Sarkar and Rémi Pautrat and Ondrej Miksik and Marc Pollefeys and Iro Armeni and Mahdi Rad and Mihai Dusmanu}, year={2026}, eprint={2602.13191}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.13191}, }