🐍 MambaVF 🐍

MambaVF: State Space Model for Efficient Video Fusion

1 ETH Zürich  2 Xi'an Jiaotong University
3 Nanyang Technological University  4 Tsinghua University

Compared with UniVF (NeurIPS'25), our MambaVF not only attains state-of-the-art performance on VF-Bench, but also requires only 7.75% of the parameters and 11.21% of the FLOPs, while achieving a 2.1× speedup.

Abstract

Video fusion is a fundamental technique in various video processing tasks. However, existing video fusion methods heavily rely on optical flow estimation and feature warping, resulting in severe computational overhead and limited scalability. This paper presents MambaVF, an efficient video fusion framework based on state space models (SSMs) that performs temporal modeling without explicit motion estimation. First, by reformulating video fusion as a sequential state update process, MambaVF captures long-range temporal dependencies with linear complexity while significantly reducing computation and memory costs. Second, MambaVF proposes a lightweight SSM-based fusion module that replaces conventional flow-guided alignment via a spatio-temporal bidirectional scanning mechanism. This module enables efficient information aggregation across frames. Extensive experiments across multiple benchmarks demonstrate that our MambaVF achieves state-of-the-art performance in multi-exposure, multi-focus, infrared-visible, and medical video fusion tasks. We highlight that MambaVF enjoys high efficiency, reducing up to 92.25% of parameters and 88.79% of computational FLOPs and a 2.1× speedup compared to existing methods.

Video Fusion Gallery


How MambaVF Works

Method

Detailed illustration of our MambaVF architecture.

Qualitative Video Comparison


Quantitative comparison with other methods

Quantitative evaluation results for the Multi-Exposure Fusion (540p) and Multi-Focus Fusion (480p) task. The red and blue highlights indicate the highest and second-highest scores

Comparison table 1

Quantitative evaluation results for the Infrared-Visible Fusion and Medical Video Fusion task. The red and blue highlights indicate the highest and second-highest scores

Comparison table 2

Refer to the main paper linked above for more details on qualitative, quantitative, and ablation studies.

Qualitative Image Comparison

VF-Bench Gallery Image

Multi-Exposure Video Fusion Comparison

Citation


  @article{zhao2026mambavfstatespacemodel,
    title={MambaVF: State Space Model for Efficient Video Fusion},
    author={Zixiang Zhao and Yukun Cui and Lilun Deng and Haowen Bai and Haotong Qin and Tao Feng and Konrad Schindler},
    journal={arXiv preprint arXiv:2602.06017},
    year={2026},
  }