| Title: |
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding |
| Authors: |
Fu, Chaoyou; Yuan, Haozhi; Dong, Yuhao; Zhang, Yi-Fan; Shen, Yunhang; Hu, Xiaoxing; Li, Xueying; Su, Jinsen; Long, Chengwu; Xie, Xiaoyao; Xie, Yongkang; Zheng, Xiawu; Yang, Xue; Cao, Haoyu; Wu, Yunsheng; Liu, Ziwei; Sun, Xing; Shan, Caifeng; He, Ran |
| Publication Year: |
2026 |
| Collection: |
ArXiv.org (Cornell University Library) |
| Subject Terms: |
Computer Vision and Pattern Recognition |
| Description: |
With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs. ; Homepage: https://video-mme-v2.netlify.app/ |
| Document Type: |
text |
| Language: |
unknown |
| Relation: |
http://arxiv.org/abs/2604.05015 |
| Availability: |
http://arxiv.org/abs/2604.05015 |
| Accession Number: |
edsbas.EBE659F |
| Database: |
BASE |