Video-MME v2

Towards the Next Stage in Video Understanding Evaluation

迈向下一代综合视频理解能力评测

Introduction

一、介绍

In 2024, we released the Video-MME benchmark, which focuses on cross-modal video understanding under different video lengths and has become a standard evaluation set for many frontier models, including Gemini and GPT. However, as model capabilities have rapidly improved, we observe that scores on Video-MME and similar benchmarks are gradually saturating, while there remains a clear gap between user experience and leaderboard performance. This suggests that current evaluation paradigms still fall short of fully capturing models' true video understanding abilities, and that the community urgently needs a new, more comprehensive benchmark to measure capability more reliably. After nearly a year of continuous refinement, we now introduce Video-MME-v2. At this key moment in the evolution of video understanding, we aim to share our thinking on the next generation of evaluation paradigms and to help drive higher-quality technical iteration for video understanding models.

🚀 First Principles

Video-MME-v2 redesigns its evaluation system starting from two fundamental questions:

  • What exactly should video understanding evaluate?
  • How can we evaluate it sufficiently and reliably?

Our answer is reflected in two core designs:

1. Progressive Multi-Level evaluation dimensions
From multi-point information aggregation → temporal understanding → temporal complex reasoning, forming a three-level progression from Finding Information to Modeling Time to Cross-Temporal Reasoning.

2. Grouped Non-Linear evaluation mechanism
Questions are organized into groups along Capability Consistency and Reasoning Coherence. Each group contains 4 interrelated questions, and we adopt a non-linear scoring scheme where scores depend not only on individual accuracy but also on overall consistency and the completeness of the reasoning loop within each group.

1. Progressive Multi-Level Evaluation Dimensions

Traditional video benchmarks often focus on specific tasks, making it difficult to form a complete, systematic picture of model capabilities. In Video-MME-v2, we combine prior research with real-world deployment experience and design three progressive evaluation dimensions to systematically decompose models' video understanding capabilities:

Level 1

Multi-Point Information Aggregation

Examines how well models retrieve, extract, and integrate multimodal cues (frames, audio, subtitles) that are scattered throughout a video. This forms the foundational perception layer of video understanding.

Level 2
Level 1 + Temporal

Temporal Understanding

On top of multi-point aggregation, this level focuses on dynamic evolution and causal relations in videos, requiring models to accurately capture state changes, action sequences, and event logic — i.e., strong temporal associations.

Level 3
Level 2 + Complex Reasoning

Temporal Complex Reasoning

Building on temporal understanding, this level targets higher-order reasoning. Models must combine multimodal temporal information with external priors such as world knowledge and social commonsense to perform multi-step reasoning and handle highly complex real-world scenarios.

2. Grouped Non-Linear Evaluation Mechanism

Many prior benchmarks adopt a Scatter-Shot evaluation paradigm where each question is scored independently, often ignoring dependencies between questions and weakening evaluation efficiency and robustness. To address this, we introduce a group-based evaluation mechanism with two key task group types:

🎯

Capability Consistency

Grouping strategy
This group type probes the true mastery of a specific capability. For a single capability, we construct 4 questions at different levels within a group around different facets of that capability. For example, in a counting scenario, we may evaluate: the number of players in a single frame, the number of action types within a short clip, the number of times a given action appears across clips, and the total number of segments in the full video. By moving from local to global and from static to cross-temporal, we more effectively distinguish between Lucky Guesses and genuinely robust capability.

Scoring method
We count the number of correctly answered questions N within each group and define the group score as (N/4)². This non-linear gain suppresses random benefits from isolated correct answers while rewarding stable, group-wide consistency.

🔗

Reasoning Coherence

Grouping strategy
This group type emphasizes the tightness of the reasoning chain. Instead of only checking the final answer, we introduce intermediate supervision by constructing 4 progressive questions around key logical nodes in the same reasoning chain. For example, when a character in a video Fakes Death to deceive others, we evaluate whether the model can: identify direct visual cues of apparent death, detect anomalous details that deviate from normal patterns, infer the purpose behind the fake death, and, under these constraints, arrive at the final conclusion. This layered Clue Localization → Anomaly Verification → Goal Inference → Conclusion Closure process helps distinguish whether a model truly relies on video evidence for coherent reasoning.

Scoring method
On top of non-linear scoring, we introduce a First-Error Truncation mechanism: during evaluation, we treat the first error within a group as the boundary of effective reasoning and only count the number of consecutive correct answers before that point. This suppresses Pseudo-Correct answers derived from incorrect premises and enforces stricter validity of the reasoning chain.

3. Data Annotation and Quality Control

The evaluation system of Video-MME-v2 places extremely high demands on annotation quality. To this end, we established a comprehensive and rigorous data annotation and quality control process. After investing 3,300 human-hours, we ultimately collected 800 videos, each paired with 4 questions and 8 answer options per question.

🎯

Data Annotation

We assembled an annotation team of 12 human experts responsible for video data collection and full-pipeline annotation, with strict quality control through standardized procedures + cross-validation: from video selection and question design to option construction, each stage strictly adheres to unified standards. Three rounds of cross-review and revision were conducted, with each sample carefully refined—especially with regard to capability consistency and logical coherence constraints. Including time spent incorporating two rounds of quality-control feedback, approximately 2,200 human-hours were invested.

🔗

Data Quality Control

We assembled an independent quality-control team of 50 experts to minimize subjective bias from human annotators and the influence of large model priors on data quality. Each video sample was independently reviewed by at least 2 quality-control personnel, who conducted two comprehensive rounds of review covering video content, question wording, option design, and results from Gemini-3-Pro text-only testing. The team then collaborated with the annotation team to review and confirm all revisions. In addition, the quality-control team completed all dataset questions themselves to calculate human accuracy rates. This process consumed approximately 1,100 human-hours.

我们于2024年发布了Video-MME,重点考察模型在不同时长条件下的跨模态视频理解能力,成为包括GeminiGPT在内众多大模型的标准评测集之一。 然而,随着模型能力持续提升,我们观察到:Video-MME以及多项同类基准的指标逐渐趋于饱和,而用户体验与榜单分数之间,依然存在明显偏差。 这意味着现有评测范式对模型真实视频理解能力的刻画仍不充分,社区亟需新的综合评测基准,以更可靠地衡量模型能力。 为此,历经近一年的持续打磨,我们正式推出Video-MME-v2。我们希望在视频理解能力演进的关键节点,分享对下一代评估范式的思考,并以此推动视频理解模型迈向更高质量技术迭代。

🚀 第一性原理

Video-MME-v2从两个根本问题出发重新设计评估体系:

  • 视频理解到底应该测什么?
  • 怎样测试才充分且可靠?

我们给出的答案体现在两项核心设计上:

1. Progressive Multi-Level评估维度
多点信息聚合→时序信息理解→时序复杂推理。三层结构对应从「找信息」到「建模时间」再到「跨时序推理」的能力递进。

2. Grouped Non-Linear评估机制
将问题按「能力一致性」与「推理连贯性」组织为多个Group,每个Group由4个相互关联的问题构成。评估时采用非线性计分:评分不仅取决于单题正确性,还考虑同一Group内答案之间的整体一致性与推理闭环的完成度。

1. Progressive Multi-Level评估维度

以往的视频理解Benchmark通常聚焦于特定任务,难以对模型能力进行完整且成体系的刻画。在Video-MME-v2中,我们结合已有研究与实际应用经验,设计了三种渐进式评测维度以系统化拆解模型的视频理解能力:

LEVEL 1

多点信息聚合

考察模型对视频中分散分布的多模态线索(视频帧、Audio、字幕)的检索、提取与整合能力,这是视频理解的基础感知层。

LEVEL 2
LEVEL 1 + 时序

时序信息理解

基于多点信息聚合,进一步聚焦视频内容的动态演变与因果关联,要求模型能够准确解析状态变化、动作序列、以及事件逻辑等强时序关联信息。

LEVEL 3
LEVEL 2 + 复杂推理

时序复杂推理

在时序理解之上,进一步考察高阶推理能力。模型在感知多模态时序信息的基础上,结合世界知识、社会常识等外部先验进行多步推理,以应对真实场景中高复杂度的理解挑战。

2. Grouped Non-Linear评估机制

以往的Benchmark多采用「散点评测」范式,即对每个问题独立计分。这种方式往往忽略问题之间的关联关系,进而影响评测的效率与鲁棒性。为此,我们引入分组式评估机制,并设计了2类关键任务组:

🎯

能力一致性

分组方式
该类分组用于验证模型对某一具体能力的真实掌握程度。针对单项能力,我们围绕同一能力的不同方面,在问题组内构造4个不同层级的问题。以视频计数为例,我们分别评估:单帧内运动员人数、单片段内动作种类、跨片段同一动作的执行次数以及全视频的片段总数。通过从局部到全局、从静态到跨时序的多层次考察,高效区分模型是「偶然答对」,还是在该能力上具有鲁棒的表现。

计分方法
我们统计每个问题组内答对的问题数量N,并将该组得分定义为(N/4)²。这种非线性增益的计分方式能够降低单题命中带来的偶然性收益,同时强化组内整体一致性与稳定正确的奖励。

🔗

推理连贯性

分组方式
该类分组侧重评估模型的推理链严密性。不同于仅关注最终答案,我们在问题组内引入中间过程监督,围绕同一推理链的关键逻辑节点构造4个递进式问题进行显示验证。以行为推断为例,若视频中人物通过「假死」实现瞒天过海,我们分别评估:模型是否能够识别死亡表象的直接线索、捕捉与常规情形不一致的反常细节、推断假死行为的目的、以及在前述证据约束下给出最终结论。通过这种「线索定位—反常核验—目的解释—结论闭环」的层级化检验,更高效区分模型是否真正基于视频证据进行连贯推理。

计分方法
在非线性计分的基础上,我们进一步引入「首错截断」机制:评估时以模型在组内首次出错的节点作为有效推理的边界,仅统计该断点之前连续答对的问题数量。该机制能够抑制错误前提下继续推导带来的「伪正确」得分,从而更严格地约束推理链的有效性。

3. 数据标注与质检

Video-MME-v2的评估体系对人工标注质量提出了极高要求。为此,我们建立了一套完整、严密的数据标注与质量检测规范。在花费3300人工时后,最终收录800个视频,每个视频配套4个问题,每个问题提供8个选项。

🎯

数据标注

我们组建了由12名人类专家组成的标注团队,负责视频数据的采集与全流程标注,并通过「流程化规范+交叉校验」进行严格质控:从视频筛选、问题设计到选项构造,各环节均严格对齐统一标准。同时设置3轮交叉复核与修改,对每条样本进行反复打磨,尤其注重能力一致性与逻辑连贯性约束。包括修改2轮数据质检意见的时间在内,共花费约2200人工时

🔗

数据质检

我们组建了由50名专家组成的独立质检团队,以尽可能降低人工标注的主观偏差及大模型先验等因素对数据质量的影响。每个视频样本均由至少2名质检人员独立负责,围绕视频内容、问题表述、选项设置以及Gemini-3-Pro纯文本测试结果开展两轮全方位核查,并与标注团队协作,对所有修改项进行复核与确认。此外,质检团队还对数据集进行了全量人工做题,以统计人类答题准确率。上述流程累计投入约1100人工时

Leaderboard

二、排行榜

📊 Model Performance Rankings模型性能排名
Table definitions: w. sub = with subtitle; wo sub = without subtitle. For Omni models, wo sub represents the model's performance without audio (muted), and w. sub represents the results with audio input (non-text subtitles). Similarly, for the Human Expert baseline, w. sub indicates that human evaluators had access to the audio. Avg Acc = average accuracy across the five dimensions. Non-Lin Score = group-level metric (Grouped Non-Linear evaluation). Due to API limitations, Gemini models are tested by extracting and compressing video frames to 60M, while GPT-5 is tested with an input of 50 frames. We argue that the proposed group-based nonlinear scoring (Non-Lin Score) more faithfully reflects true model performance than the traditional per-question average accuracy (Avg Acc). 表头说明:w. sub = 使用字幕;wo sub = 不使用字幕。对于Omni模型,wo sub表示模型在无音频(静音)条件下的性能,w. sub表示语音输入(非文本字幕)下的结果。同样地,对于Human Expert基线,w. sub 表示人类评估者在测试时可以听到音频。Avg Acc = 五维度平均准确率。Non-Lin Score = 组级别指标(Grouped Non-Linear评估)。由于API限制,Gemini是将视频抽帧压缩到60M后进行测试,GPT-5采用50帧输入测试。我们认为所提出的基于Group的Non-Lin Score比传统基于每道题的Avg Acc能够更真实地反映模型性能。
Model模型 Frames Non-Lin ScoreNon-Lin Score Level 1 Level 2 Level 3 Capability Consistency能力一致性 Reasoning Coherence推理连贯性 Avg Acc
w. sub wo sub w. sub wo sub w. sub wo sub w. sub wo sub w. sub wo sub w. sub wo sub w. sub wo sub

Dataset Examples

三、数据集示例

Each level contains examples from two evaluation dimensions. Scroll horizontally to view more, click QA tabs to switch questions.

每个层级包含两种评测维度的示例。左右滑动查看更多,点击 QA 标签切换问答内容。

Consistency — 4 questions test the same capability across different dimensions/granularities — 4个问题在不同维度/粒度下测试同一能力
Coherence — 4 questions form a logical chain, each depends on prior answers — 4个问题构成逻辑链,后续问题依赖前序答案
L1

Level 1: Information Retrieval & Aggregation

Level 1: 信息检索与聚合

Basic visual perception — object recognition, counting, attribute judgment, scene understanding

基础视觉感知 — 对象识别、计数、属性判断、场景理解

L2

Level 2: Temporal Understanding

Level 2: 时序理解

Action recognition, state change tracking, temporal ordering, dynamic perception

动作识别、状态变化跟踪、时间顺序理解、动态感知

L3

Level 3: Complex Reasoning

Level 3: 复杂推理

Causal reasoning, intention understanding, physical world reasoning, social behavior inference

因果推理、意图理解、物理世界推理、社会行为推断

Dataset and Annotation

四、数据集与标注

To comprehensively evaluate the video understanding capabilities of multimodal large models, Video-MME-v2 has constructed a progressive hierarchical classification system. We have abandoned flat task stacking and instead divided capability dimensions into three cognitive stages: from basic information retrieval and aggregation, advancing to dynamic capture of temporal sequences, actions, and changes, and ultimately rising to complex reasoning about plots, the physical world, and social behaviors.

📊 Three-Level Capability Hierarchy

Fig. 1 below shows the three levels and their capability dimensions. Each level contains several categories; each category groups related sub-dimensions as follows.

Level 1 — Retrieval & Aggregation

Frame-Only (3 types) — Visual Recognition (object/attribute/scene); Basic Counting; Numerical Calculation (rates, comparisons).
Frames & Audio (4 types) — Cross-Modal Semantic Consistency (tone–mood alignment); Audio-Guided Visual Description; Vision-Guided Audio Description; Visual-Audio Collaborative Reasoning.

Level 2 — Level 1 + Temporal Understanding

Action & Motion (5 types) — Fine-Grained Action Recognition; Repetitive Action Counting; Temporal Action Localization; Motion Trajectory Estimation; Motion Properties Analysis.
Order (3 types) — Object Appearance Ordering; Event Sequence Ordering; Temporal Periodicity Detection.
Change (3 types) — Entity Existence Change Detection; Entity Attribute Change Detection; Scene Transformation Detection.
Temporal Reasoning (2 types) — Causal Reasoning (why/what-if); Future Event Prediction.

Level 3 — Level 2 + Complex Reasoning

Complex Plot Comprehension (4 types) — Narrative Turning Point Detection; Narrative Cloze Inference; Symbolic / Metaphorical Interpretation; High-Order Narrative Deconstruction.
Video-Based Knowledge Acquisition (2 types) — Professional Knowledge Acquisition; General Skills Acquisition.
Social Behavior Analysis (3 types) — Individual Social Cognition; Dyadic Interaction Dynamics; Collective Dynamics Analysis.
Physical World Reasoning (4 types) — Entity Persistence Tracking; Spatial Understanding; Counterfactual Reasoning; Counterintuitive Comprehension.
Video-MME-v2 Three-Level Capability Hierarchy
Figure 1: Video-MME-v2 three-level capability hierarchy — capability dimensions and their distribution across Level 1 (Retrieval & Aggregation), Level 2 (Temporal Understanding), and Level 3 (Complex Reasoning).

🔧 Annotation Pipeline

✓ Fully Human Expert-Led · Rigorous Multi-Stage Quality Assurance
1 Video Selection
Video source: Over 80% of the videos are YouTube uploads from 2025 onward, ensuring temporal freshness and reducing contamination risk.
Diversity coverage: A taxonomy with 4 top-level categories and 31 subcategories guarantees broad coverage of topics and visual styles.
Content quality control: View-count thresholds (about 85% of videos exceed 10,000 views) filter out low-quality, noisy samples at the source.
Manual decontamination: Classic films and flagship videos from top creators are manually removed to minimize evaluation bias from model memorization effects.
2 Question Design
Question group annotation: A team of 12 human experts annotate question groups, ensuring broad coverage for capability consistency and sufficient depth for reasoning coherence.
Rigor check: During annotation, Gemini-3-Pro is used in real time to test and verify question wording and answer settings, ensuring precision and robustness.
3 Option Design
High-confusion options: 8-option multiple-choice design improves discriminative power and evaluation strength.
Strong distractor design: Beyond regular distractors, each question includes at least one additional, carefully crafted distractor targeted around the correct answer and refined by human annotators to test fine-grained discrimination.
4 Quality Control
a. Text-Only Check

Use Gemini-3-Pro in text-only mode as a baseline to remove questions that can be solved without visual information, strictly controlling language priors and ensuring the necessity of multimodal perception.

b. Cross-Review

Conduct multiple rounds of cross review: each question is reviewed in three rounds by different annotators to eliminate semantic ambiguity, patch potential flaws and refine option design.

c. External Validation

Introduce 50 independent reviewers who did not participate in the original annotation; each video question is checked in at least two fine-grained passes to reduce subjective bias.

d. Re-Validation

Establish a revision–retest loop: any modified question is re-run under the text-only baseline and independently re-validated to ensure each round of changes yields controlled quality improvements.

📈 Data Statistics

1
Video length: As shown in Fig. 2, the average length is about 10.4 minutes. 99% are 20 minutes or shorter; 53% are 10 minutes or shorter. The distribution is relatively uniform and diverse.
2
Video category: As shown in Fig. 3, video types cover 4 major categories and 31 subcategories, including Sports & Competition (e.g. basketball, soccer), Lifestyle & Entertainment (e.g. variety shows, digital), Art & Literature (e.g. film, comics), and Knowledge & Education (e.g. AI, humanities & history).
3
Video publication time: As shown in Fig. 4, more than 80% of the videos were published after 2025, with nearly 40% published after October 2025.
4
Video view count: As shown in Fig. 5, the mean and median view counts are 4.83 million and 355 thousand respectively. 84.3% of videos exceed 10,000 views, and 94.4% exceed 1,000 views.
5
Questions & answers: As shown in Fig. 2, the average length of questions and answers increases from Q1 to Q4. This aligns with our Reasoning Coherence design: later questions in the sequence are harder and typically require more contextual description and more detailed answers.
6
Options distribution: As shown in Fig. 2, the mean word count across the 8 options is highly consistent.
Video-MME-v2 Data Statistics
Figure 2: Video length, and question and option length distribution
Video-MME-v2 video category distribution (sunburst)
Figure 3: Video category distribution
Video-MME-v2 Video Publication Time Distribution
Figure 4: Video publication time distribution
Video-MME-v2 Video View Count Distribution
Figure 5: Video view count distribution

为全面评估模态的视频理解能力,我们为Video-MME-v2构建了一套循序渐进的层级化能力体系。不再采用扁平化的任务堆砌,而是将能力维度划分为三个递进阶段:首先是对分散线索的信息检索与聚合;其次是对状态变化、动作序列等时序动态的准确建模;最终进一步要求模型结合真实世界知识与社会常识等外部先验,完成更高阶的复杂推理。

📊 三级能力层级

下图Fig. 1展示了能力层级。每层包含若干类,每类下为相关子维度,对应关系如下:

Level 1 — 信息检索与聚合

仅帧 (Frame-Only) 共3类 — 视觉识别;基础计数;帧相关的数值计算。
帧+音频 (Frames & Audio) 共4类 — 跨模态语义一致性;音频引导的视觉描述;视觉引导的音频描述;视听觉协同理解。

Level 2 — Level 1 + 时序

动作与运动 (Action & Motion) 共5类 — 细粒度动作识别;重复动作计数;时序动作定位;运动轨迹估计;运动属性分析。
顺序 (Order) 共3类 — 物体出现顺序;事件序列排序;时序周期性检测。
变化 (Change) 共3类 — 实体存在变化检测;实体属性变化检测;场景变换检测。
时序推理 (Temporal Reasoning) 共2类 — 因果推理;未来事件预测。

Level 3 — Level 2 + 复杂推理

复杂剧情理解 (Complex Plot Comprehension) 共4类 — 叙事转折点检测;叙事填空推理;符号/隐喻解读;高阶叙事解构。
基于视频的知识获取 (Video-Based Knowledge Acquisition) 共2类 — 专业知识获取;通用技能获取。
社会行为分析 (Social Behavior Analysis) 共3类 — 个体社会认知;二元互动动态;群体动态分析。
物理世界推理 (Physical World Reasoning) 共4类 — 实体持久性追踪;空间理解;反事实推理;反直觉理解。
Video-MME-v2 的三级能力层级分类体系
Figure 1: Video-MME-v2三级能力层级:各能力维度及其在Level 1(信息检索与聚合)、Level 2(时序理解)、Level 3(复杂推理)中的分布

🔧 标注流程

✓ 完全由人类专家主导 · 严谨的多阶段质量保障
1 视频筛选
视频来源:数据集中80%+为2025年及以后发布的YouTube视频,以确保评估基准的时效性并降低数据污染风险。
多样性覆盖:构建包含4个一级类别、31个二级类别的分类体系,确保题材与风格广泛分布。
内容质量控制:引入播放量作为内容质量的近似指标,近85%的视频观看量超过10,000次,从源头减少低质噪声样本。
人工防污染:人工剔除经典影视作品及头部博主内容,最大限度规避模型「记忆效应」带来的评估偏差。
2 问题设计
问题组标注:由12名人工专家负责,确保能力一致性问题的覆盖广度,以及逻辑连贯性的问题深度。
严谨性校验:标注过程中实时使用Gemini-3-Pro进行对照测试与校验,确保题目表述与答案设置的严谨性。
3 选项设计
高困惑度选项:采用8选项设计,提高区分度与评测强度。
强干扰项设计:除常规强干扰项之外,每题至少额外设置一个基于真实答案针对性构造的迷惑选项,并经人工精细润色,以精准测试模型的细粒度辨析能力。
4 质量把控
a. 文本基线测试

利用Gemini-3-Pro在纯文本模式下进行基线测试,剔除无需视觉信息仍可作答的题目,从而严格控制语言先验偏差,确保多模态感知的必要性。

b. 交叉校验

实施多轮交叉评审流程,每道题目由标注人员进行3轮交叉审核,重点消除语义歧义、修正潜在漏洞并优化选项设置。

c. 独立盲测

引入50名未参与该标注任务的独立评估人员,对每个视频问题进行至少2轮细粒度核查,以更客观视角降低标注人员的主观偏差。

d. 闭环迭代验证

建立"修正-复验"的闭环机制:对修改后的题目再次执行纯文本检测与独立盲测,确保每一轮修订均带来可控的质量提升。

📈 数据统计

1
视频时长:如Fig. 2所示,平均长度约10.4分钟。其中99%在20分钟以内,53%在10分钟以内,整体时长分布较为均匀且多样。
2
视频类别:如Fig. 3所示,视频类型覆盖4大类、31小类,包括体育与竞技(如篮球、足球等)、生活与娱乐(如综艺、数码等)、艺术与文艺(如电影、漫画等),知识与教育(如人工智能、人文历史等)。
3
视频发布时间:如Fig. 4所示,80%+的视频发布于2025年之后,其中近40%发布于2025年10月之后。
4
视频观看量:如Fig. 5所示,视频观看量的均值和中位数分别为483万次和35.5万次,其中84.3%的视频观看量超过10,000 次,94.4%超过 1,000次。
5
问题与答案:如Fig. 2所示,问题与答案的平均长度从Q1到Q4呈递增趋势。这与「推理连贯性」设计一致:序列后段问题难度更高,通常需要更充分的上下文描述与更细致的作答。
6
选项分布:如Fig. 2所示,8个选项的平均词数高度一致。
Video-MME-v2 数据统计
Figure 2:视频长度、以及问题与选项长度分布
Video-MME-v2 视频类别分布(旭日图)
Figure 3:视频类别分布,含4个一级类别、31个二级类别
Video-MME-v2 视频发布时间分布
Figure 4:视频发布时间分布
Video-MME-v2 视频观看量分布
Figure 5:视频观看量分布

Experiments & Analysis

五、实验与分析

We conducted systematic evaluation on a number of leading video multimodal large models; results are shown in the leaderboard above. Building on this, we summarize several representative experimental findings below.

我们对多款前沿多模态大模型进行了系统评测,结果如上述排行榜所示。在此基础上,我们进一步总结了一些代表性的实验现象。

📊 Advantage of Non-Linear Scoring非线性打分优势

We compare two metrics: group-based non-linear score (Non-Lin Score) and per-question average accuracy (Avg Acc).

我们对比了两种指标:基于分组的非线性得分(Non-Lin Score)逐题统计的平均准确率(Avg Acc)

1. Within-model comparison: Gemini-3-Pro and Gemini-3-Flash reach average accuracy of 66.1% and 61.1% respectively—well above passing level. Under our group-based non-linear scoring, however, their scores are 49.4% and 42.5%. This shows that even SOTA models rarely answer all related questions in a group correctly. By explicitly leveraging the group structure, our nonlinear scoring is less sensitive to isolated correct predictions and instead emphasizes consistency across related queries, thereby providing a more faithful assessment of true model capability.

1. 同一模型内对比:Gemini-3-ProGemini-3-Flash的平均准确率分别为66.1%与61.1%,高于及格水平。但在分组非线性计分下,得分仅为49.4%与42.5%。这表明即使是SOTA模型,也难以在同一组内稳定地将多道关联问题全部答对。我们设计的非线性打分能够利用Group结构,降低对「零散命中」的敏感性。

2. Cross-model comparison: The ratio Non-Lin Score/Avg Acc reflects how much a model drops from single-question correctness to group-stable correctness, and thus indicates robustness. For example, Gemini-3-Pro achieves a ratio of approximately 75%, followed by Doubao-Seed-2.0-Pro-260215 at around 72%, and InternVL3-5-241B-A28B-Instruct at about 56%, while the smaller model LLaVA-Video-7B achieves only around 40%. A lower ratio means the model more often gets only some questions right within a group—weaker stability and robustness. Non-linear scoring thus better reflects true capability and reveals model robustness.

2. 不同模型间对比:「Non-Lin Score / Avg Acc」的比值可用于衡量模型从「单题正确」到「组内稳定正确」的折损程度,从而反映模型鲁棒性差异。例如,Gemini-3-Pro的比值约为75%,Doubao-Seed-2.0-Pro-260215约为72%,而 InternVL3-5-241B-A28B-Instruct 约为56%,一些小模型如 LLaVA-Video-7B 仅为约40%。比值越低,说明模型越容易出现「组内只能答对部分题」的现象,稳定性与鲁棒性越弱。由此可见非线性打分在真实刻画能力水平、揭示模型鲁棒性方面的优势。

Avg Acc vs. Non-Lin Score — selected models Avg Acc与Non-Lin Score对比
Model模型 Avg Acc (%)Avg Acc (%) Non-Lin Score (group-level metric)Non-Lin Score(组级别指标) Non-Lin Score / Avg AccNon-Lin Score / Avg Acc
Gemini-3-Pro 66.1% 49.4% ~75%
Gemini-3-Flash 61.1% 42.5% ~70%
Doubao-Seed-2.0-Pro-260215 60.5% 43.3% ~72%
Qwen3.5-397B-A17B-Think (512) 55.9% 39.1% ~70%
MiMo-v2-Omni 56.1% 38.6% ~69%
Qwen3.5-397B-A17B-Think (64) 48.9% 30.6% ~63%
InternVL3-5-241B-A28B-Instruct 41.4% 23.1% ~56%
LLaVA-Video-7B 24.0% 9.7% ~40%

📉 Capability Consistency and Reasoning Coherence Analysis能力一致性与推理连贯性分析

1. Overall Q1→Q4 accuracy trend: We report overall accuracy from Q1 to Q4 for five models in both Capability Consistency and Reasoning Coherence question groups, and analyze from both data and model perspectives.

1. Q1→Q4的整体准确率趋势:我们统计了5个模型在「能力一致性」与「推理连贯性」两类问题组中 Q1→Q4 的整体准确率,并从数据和模型两个视角进行分析。

(1)Data perspective: In Capability Consistency groups, accuracy across Q1–Q4 is similar for all models, indicating that difficulty is well balanced across question indices. In Reasoning Coherence groups, accuracy consistently decreases from Q1 to Q4 for all models, indicating that difficulty increases along the sequence—consistent with our design.

(1)数据角度:在「能力一致性」组中,各模型在Q1→Q4上的准确率整体接近,说明不同编号题目的总体难度分布较为均衡。而在「推理连贯性」组中,所有模型的准确率均呈现 Q1→Q4 逐步下降的趋势,表明组内题目难度按序递增。这符合我们的原始设定。

(2)Model perspective: In Capability Consistency groups, Gemini-3-Pro and GPT-5 exhibit only marginal fluctuations in accuracy from Q1 to Q4, indicating stronger stability. In Reasoning Coherence groups, stronger models exhibit a smooth decline in accuracy from Q1 to Q4 as question difficulty increases, whereas weaker models show more irregular patterns. One possible explanation is that stronger models are more sensitive to incremental changes in question difficulty, resulting in a more uniform degradation as reasoning depth increases. In contrast, weaker models tend to exhibit higher stochasticity, leading to unstable performance across progressively harder questions..

(2)模型角度:在「能力一致性」组中,Gemini-3-ProGPT-5 的 Q1→Q4 准确率波动很小,体现更强的稳定性。在「推理连贯性」组中,随着问题难度的增加,较强的模型从Q1到Q4的准确性平稳下降,而较弱的模型则显示出不规则的变化。一种可能的解释是,更强的模型对问题难度的增量变化更敏感,从而导致推理深度增加时更均匀的退化。相反,较弱的模型往往表现出更高的随机性,导致在越来越难的问题上表现不稳定。

2. Mean and variance in Capability Consistency groups: We further report mean and variance of overall Q1–Q4 accuracy for eight models in Capability Consistency groups. As shown in the rightmost plot, the horizontal and vertical axes represent average performance and result stability respectively, jointly characterizing both performance and robustness of video understanding. We have the following observations:

2. 「能力一致性」组的均值与方差:我们进一步统计了8个模型在「能力一致性」组上 Q1→Q4 整体准确率的均值与方差。如最右侧图所示,横轴与纵轴分别对应模型的平均表现与结果稳定性,从而同时刻画视频理解能力的性能与鲁棒性。主要观察到两点结论:

(1)Gemini-3-Pro achieves the highest mean accuracy, indicating the strongest overall performance. At the same time, it exhibits the smallest variance, demonstrating the best stability. GPT-5 and Kimi-K2.5 follow closely in terms of stability, also showing strong robustness.
(2)Overall, commercialization models generally outperform open-source models, yet all models still remain substantially below human performance, indicating a significant gap to close.

(1)Gemini-3-Pro 取得了最高的平均准确率,整体性能最强;同时,其方差最小,表现出最佳稳定性。GPT-5Kimi-K2.5 在稳定性方面紧随其后,同样展现出较强鲁棒性。
(2)商业化模型通常优于开源模型,但所有模型仍然远远低于人类的表现,这表明当前模型仍然有很大的提升空间。

Q1-Q4 accuracy and mean/variance in capability consistency
Figure 6: Overall Q1→Q4 accuracy in Capability Consistency and Reasoning Coherence groups, and mean & variance in the Capability Consistency group.
Figure 6:「能力一致性」组与「推理连贯性」组 Q1→Q4 的整体准确率变化,以及「能力一致性」组的均值与方差统计

🧠 Effect of Thinking Mode on Video-MME-v2Thinking模式在Video-MME-v2上的作用

We compare the performance changes of instruction-tuned baseline models after enabling the Thinking mode, under both under with- and without-subtitle conditions. For Gemini-3-Flash, the comparison is between Minimal_Thinking and the standard Thinking configuration, both at 1fps.

我们在Video-MME-v2上对比了无/有字幕两种设置下,更强推理配置对模型性能的影响。图中展示了每个模型的Instruct基线,以及切换到更强Reasoning Mode 后带来的增益(Gain)与倒退(Regression)。对于Gemini-3-Flash,由于模型限制,对比的是同为1fps条件下的Minimal_Thinking与标准Thinking配置。

1. Text modality helps unlock reasoning: Overall, enabling Thinking with subtitles tends to yield more stable gains, while without subtitles the benefit is often smaller or can even turn negative. For example, Qwen3.5-122B-A10B gains +3.8 with no subtitle and +5.8 with subtitle on overall score. This suggests that explicit semantic cues from text make it easier for the model’s Thinking ability to be fully utilized.

1. 文本模态有助于激发推理能力:从整体趋势来看,同一模型在「有字幕」条件下开启Thinking往往获得更稳定的增益,而在「无字幕」时收益通常更弱,甚至可能转为负增益。以Qwen3.5-122B-A10B为例,在无/有字幕设定下,Thinking带来的整体提升为+3.8/+5.8。这一现象表明,文本模态提供的显式语义线索更容易促使模型的Thinking能力得到充分发挥。

2. Current Thinking mode can also cause regression: Besides the general pattern that subtitles help Thinking, we still observe clear regressions for some settings, especially without subtitles. For example, Qwen3-VL-8B shows -0.6 without subtitle on overall score, while KimiVL-16B drops by -3.3/-3.3 (without/with subtitle), and on Level 3, where Thinking matters most, it further drops by -4.0/-3.9. This shows that the current Thinking mechanism in video MLLMs does not always bring positive benefit on video understanding tasks and still has substantial room for improvement.

2. 当前Thinking模式可能导致性能倒退:除了「字幕更利于 Thinking」的总体规律外,我们仍然观察到部分设定在开启Thinking 后出现明显回退,且在「无字幕」条件下更常见。例如Qwen3-VL-8B在无字幕设定下整体分数为-0.6,KimiVL-16B在无/有字幕设定下整体分别下降-3.3/-3.3;在最需要Thinking能力的Level 3上,甚至进一步下降到-4.0/-3.9。这说明当前多模态大模型的Thinking机制在视频理解任务中并非总能带来正向收益,仍有较大提升空间。

Effect of Thinking with and without subtitle by level and overall
Figure 7: Score by level and overall under Thinking mode (with/without subtitle)
Figure 7: Thinking模式下各层级与整体得分(无/有字幕)

🧠 Overall Model Performance Analysis on Video-MME-v2Video-MME-v2 上对整体模型性能分析

Around the three-level task framework of Video-MME-v2, we abstract three key underlying capabilities: omni-modal information aggregation (C1), long-range temporal / long-context understanding (C2), and complex reasoning (C3). Based on these, we profile and group existing models and compare their scores.

围绕Video-MME-v2提出的三层任务体系,我们进一步抽象出支撑各层任务的三类关键基础能力:全模态信息聚合(C1)长程时序/长上下文理解(C2)复杂推理能力(C3)。基于这三项能力,我们对现有模型进行能力画像与分组,并对比其得分表现。

Model Capability Profiles and Scores

模型能力画像与得分

Model Name模型名称 Non-Lin Score (w. sub)Non-Lin Score(有字幕) Capabilities能力
Gemini-3-Pro 49.4
c1c2c3
Gemini-3-Flash 42.5
c1c2c3
Qwen3.5-397B-A17B-Think (512) 39.1
c2c3
MiMo-v2-Omni 38.6
c1c2c3
Qwen3.5-397B-A17B-Think (64) 30.6
c2c3
Qwen3-VL-235B-A22B-Think 28.1
c2c3
Qwen3-Omni-30B-A3B-Think 19.5
c1c2c3
Qwen3-Omni-30B-A3B-Instruct 17.1
c1c2
Capability Legend:能力图例:
  • c1 Omni-modal (Omni-modal information aggregation)全模态信息聚合
  • c2 Long-context (Ability to process extended inputs)长程时序/长上下文
  • c3 Thinking (Complex reasoning)复杂推理

1. Synergy of core capabilities: Scores tend to correlate with how complete the capability profile is: models with C1+C2+C3 together generally perform better. For example, Gemini-3-Pro has a relatively complete profile and scores 49.4; Gemini-3-Flash follows with 42.5. This suggests that in complex video understanding, the synergy of omni-modal perception, long-horizon temporal modeling, and deep reasoning is an important factor for overall performance.

1. 核心能力协同效应:从评测结果可以观察到,模型得分与其核心能力的「完整度」存在一定相关性:同时具备C1+C2+C3的模型整体上更具优势。例如,Gemini-3-Pro具备较完整的能力矩阵,得分49.4;Gemini-3-Flash紧随其后,得分42.5。这一现象表明,在复杂视频理解任务中,全模态感知、长时序建模与深度推理的协同可能是提升整体表现的重要因素。

2. Model scale and capability compensation: Besides capability combination, results show that scale has a significant effect on base performance: larger parameter count can partly compensate for missing capabilities. For example, Qwen3.5-397B-A17B-Think mainly has long-context ability (C2) and complex reasoning (C3), yet reaches 39.1—higher than MiMo-v2-Omni (38.6), which has omni-modal capability (C1) and complex reasoning (C3). This shows that when scale increases substantially, the model’s overall capability can partly offset the impact of missing individual capabilities on the score.

2. 模型规模与能力代偿:除能力融合外,结果也显示模型规模对基础性能具有显著影响:更大的参数规模可能部分弥补特定能力的缺失。以Qwen3.5-397B-A17B-Think为例,其主要具备长上下文能力(C2)和复杂推理(C3),但未显式具备全模态能力(C1),仍取得 39.1 分。这一成绩高于具备全模态能力(C1)和复杂推理(C3)的MiMo-v2-Omni(38.6 分)。该显示出当参数规模显著提升时,模型的综合能力表现可能在一定程度上抵消单项能力缺口对得分带来的影响。

3. Impact of frame count on performance: For the same model, increasing frame count can significantly improve performance. For example, Qwen3.5-397B-A17B-Think with 512 frames scores 39.1, while with 64 frames it scores only 30.6—an 8.5-point improvement. This highlights the importance of long-context processing capability (C2) for complex video understanding tasks.

3. 帧数对性能的影响:对于同一模型,增加处理的帧数可以显著提升性能。例如,Qwen3.5-397B-A17B-Think在512帧设置下得分39.1,而在64帧设置下仅得分30.6,提升了8.5分。这突显了长上下文处理能力(C2)对复杂视频理解任务的重要性。

🎯 Capability Radar能力雷达图

We compare selected models on the capability dimensions defined by Video-MME-v2. From the radar chart, three main observations can be drawn:

基于Video-MME-v2定义的能力项,我们对部分模型的表现进行了对比。如雷达图所示,可以归纳出三点主要观察:

1. Significant gain from audio: On the Frames & Audio dimension, Gemini-3-Pro shows a relatively high peak, indicating stronger cross-modal alignment and integration when processing synchronized visual and audio information. In contrast, models that rely more on visual frames (e.g. GPT-5 and the Qwen family) are relatively weaker, reflecting differences in deep multimodal fusion.

1. 音频带来的显著增益:在Frames&Audio维度上,Gemini-3-Pro呈现出相对高峰,表明其在处理视觉与音频的同步信息时具备更强的跨模态对齐与整合能力。相比之下,更依赖视觉帧的模型(如GPT-5Qwen系列)表现相对较弱,反映出不同模型在多模态深度融合上的差异。

2. Long-horizon temporal reasoning advantage: On capabilities such as Order and Video-Based Knowledge Acquisition, which rely on long-horizon temporal modeling and cross-segment reasoning over long video frames, Gemini-3-Pro also maintains a large lead, indicating more robust long-context and temporal modeling, and better ability to integrate and reason over information across segments in long videos.

2. 长程时序推理优势:在Order与Video-Based Knowledge Acquisition等需要基于长视频帧进行时序建模与跨片段推理的能力项上,Gemini-3-Pro同样保持较大领先幅度,表明其在长上下文处理和时序建模机制上更具鲁棒性,更能应对长视频帧中的跨片段信息整合与推理需求。

3. Clear room for improvement: Overall, even as a SOTA model, Gemini-3-Pro still has significant room for improvement on each dimension. In particular, on Action & Motion and Physical World Reasoning, scores remain below 30, reflecting that current models still need to strengthen fine-grained action semantics and physical-world reasoning.

3. 提升空间明显:整体来看,即便作为SOTA模型,Gemini-3-Pro在各项能力上仍有显著提升空间。尤其是在Action&Motion与Physical World Reasoning等维度上得分仍不足30,反映出当前模型在细粒度动作语义建模与物理规律理解等方面仍需进一步加强。

Level 1: Retrieval & Aggregation Level 1:检索与聚合
Level 2: Temporal Understanding Level 2:时序理解
Level 3: Complex Reasoning Level 3:复杂推理
Figure 8: Capability radar (Click on the model names in the legend to show/hide specific models)
Figure 8: 能力雷达图(点击图例中的模型名称即可显示/隐藏对应的模型结果)

Citation

六、引用

@article{videommev2_2026, title={Video-MME-v2: Evaluating True Understanding and Reasoning in Video MLLMs}, author={Video-MME Team}, journal={arXiv preprint}, year={2026}, url={https://github.com/Video-MME/Video-MME-v2} }