In 2024, we released the Video-MME benchmark, which focuses on cross-modal video understanding under different video lengths and has become a standard evaluation set for many frontier models, including Gemini and GPT. However, as model capabilities have rapidly improved, we observe that scores on Video-MME and similar benchmarks are gradually saturating, while there remains a clear gap between user experience and leaderboard performance. This suggests that current evaluation paradigms still fall short of fully capturing models' true video understanding abilities, and that the community urgently needs a new, more comprehensive benchmark to measure capability more reliably. After nearly a year of continuous refinement, we now introduce Video-MME-v2. At this key moment in the evolution of video understanding, we aim to share our thinking on the next generation of evaluation paradigms and to help drive higher-quality technical iteration for video understanding models.
Video-MME-v2 redesigns its evaluation system starting from two fundamental questions:
- What exactly should video understanding evaluate?
- How can we evaluate it sufficiently and reliably?
Our answer is reflected in two core designs:
1. Progressive Multi-Level evaluation dimensions
From multi-point information aggregation → temporal understanding → temporal complex reasoning, forming a three-level progression from Finding Information to Modeling Time to Cross-Temporal Reasoning.
2. Grouped Non-Linear evaluation mechanism
Questions are organized into groups along Capability Consistency and Reasoning Coherence. Each group contains 4 interrelated questions, and we adopt a non-linear scoring scheme where scores depend not only on individual accuracy but also on overall consistency and the completeness of the reasoning loop within each group.
1. Progressive Multi-Level Evaluation Dimensions
Traditional video benchmarks often focus on specific tasks, making it difficult to form a complete, systematic picture of model capabilities. In Video-MME-v2, we combine prior research with real-world deployment experience and design three progressive evaluation dimensions to systematically decompose models' video understanding capabilities:
Multi-Point Information Aggregation
Examines how well models retrieve, extract, and integrate multimodal cues (frames, audio, subtitles) that are scattered throughout a video. This forms the foundational perception layer of video understanding.
Level 1 + Temporal
Temporal Understanding
On top of multi-point aggregation, this level focuses on dynamic evolution and causal relations in videos, requiring models to accurately capture state changes, action sequences, and event logic — i.e., strong temporal associations.
Level 2 + Complex Reasoning
Temporal Complex Reasoning
Building on temporal understanding, this level targets higher-order reasoning. Models must combine multimodal temporal information with external priors such as world knowledge and social commonsense to perform multi-step reasoning and handle highly complex real-world scenarios.
2. Grouped Non-Linear Evaluation Mechanism
Many prior benchmarks adopt a Scatter-Shot evaluation paradigm where each question is scored independently, often ignoring dependencies between questions and weakening evaluation efficiency and robustness. To address this, we introduce a group-based evaluation mechanism with two key task group types:
Capability Consistency
Grouping strategy
This group type probes the true mastery of a specific capability. For a single capability, we construct 4 questions at different levels within a group around different facets of that capability. For example, in a counting scenario, we may evaluate: the number of players in a single frame, the number of action types within a short clip, the number of times a given action appears across clips, and the total number of segments in the full video. By moving from local to global and from static to cross-temporal, we more effectively distinguish between Lucky Guesses and genuinely robust capability.
Scoring method
We count the number of correctly answered questions N within each group and define the group score as (N/4)². This non-linear gain suppresses random benefits from isolated correct answers while rewarding stable, group-wide consistency.
Reasoning Coherence
Grouping strategy
This group type emphasizes the tightness of the reasoning chain. Instead of only checking the final answer, we introduce intermediate supervision by constructing 4 progressive questions around key logical nodes in the same reasoning chain. For example, when a character in a video Fakes Death to deceive others, we evaluate whether the model can: identify direct visual cues of apparent death, detect anomalous details that deviate from normal patterns, infer the purpose behind the fake death, and, under these constraints, arrive at the final conclusion. This layered Clue Localization → Anomaly Verification → Goal Inference → Conclusion Closure process helps distinguish whether a model truly relies on video evidence for coherent reasoning.
Scoring method
On top of non-linear scoring, we introduce a First-Error Truncation mechanism: during evaluation, we treat the first error within a group as the boundary of effective reasoning and only count the number of consecutive correct answers before that point. This suppresses Pseudo-Correct answers derived from incorrect premises and enforces stricter validity of the reasoning chain.
3. Data Annotation and Quality Control
The evaluation system of Video-MME-v2 places extremely high demands on annotation quality. To this end, we established a comprehensive and rigorous data annotation and quality control process. After investing 3,300 human-hours, we ultimately collected 800 videos, each paired with 4 questions and 8 answer options per question.
Data Annotation
We assembled an annotation team of 12 human experts responsible for video data collection and full-pipeline annotation, with strict quality control through standardized procedures + cross-validation: from video selection and question design to option construction, each stage strictly adheres to unified standards. Three rounds of cross-review and revision were conducted, with each sample carefully refined—especially with regard to capability consistency and logical coherence constraints. Including time spent incorporating two rounds of quality-control feedback, approximately 2,200 human-hours were invested.
Data Quality Control
We assembled an independent quality-control team of 50 experts to minimize subjective bias from human annotators and the influence of large model priors on data quality. Each video sample was independently reviewed by at least 2 quality-control personnel, who conducted two comprehensive rounds of review covering video content, question wording, option design, and results from Gemini-3-Pro text-only testing. The team then collaborated with the annotation team to review and confirm all revisions. In addition, the quality-control team completed all dataset questions themselves to calculate human accuracy rates. This process consumed approximately 1,100 human-hours.
我们于2024年发布了Video-MME,重点考察模型在不同时长条件下的跨模态视频理解能力,成为包括Gemini和GPT在内众多大模型的标准评测集之一。 然而,随着模型能力持续提升,我们观察到:Video-MME以及多项同类基准的指标逐渐趋于饱和,而用户体验与榜单分数之间,依然存在明显偏差。 这意味着现有评测范式对模型真实视频理解能力的刻画仍不充分,社区亟需新的综合评测基准,以更可靠地衡量模型能力。 为此,历经近一年的持续打磨,我们正式推出Video-MME-v2。我们希望在视频理解能力演进的关键节点,分享对下一代评估范式的思考,并以此推动视频理解模型迈向更高质量技术迭代。
Video-MME-v2从两个根本问题出发重新设计评估体系:
- 视频理解到底应该测什么?
- 怎样测试才充分且可靠?
我们给出的答案体现在两项核心设计上:
1. Progressive Multi-Level评估维度
多点信息聚合→时序信息理解→时序复杂推理。三层结构对应从「找信息」到「建模时间」再到「跨时序推理」的能力递进。
2. Grouped Non-Linear评估机制
将问题按「能力一致性」与「推理连贯性」组织为多个Group,每个Group由4个相互关联的问题构成。评估时采用非线性计分:评分不仅取决于单题正确性,还考虑同一Group内答案之间的整体一致性与推理闭环的完成度。
1. Progressive Multi-Level评估维度
以往的视频理解Benchmark通常聚焦于特定任务,难以对模型能力进行完整且成体系的刻画。在Video-MME-v2中,我们结合已有研究与实际应用经验,设计了三种渐进式评测维度以系统化拆解模型的视频理解能力:
多点信息聚合
考察模型对视频中分散分布的多模态线索(视频帧、Audio、字幕)的检索、提取与整合能力,这是视频理解的基础感知层。
LEVEL 1 + 时序
时序信息理解
基于多点信息聚合,进一步聚焦视频内容的动态演变与因果关联,要求模型能够准确解析状态变化、动作序列、以及事件逻辑等强时序关联信息。
LEVEL 2 + 复杂推理
时序复杂推理
在时序理解之上,进一步考察高阶推理能力。模型在感知多模态时序信息的基础上,结合世界知识、社会常识等外部先验进行多步推理,以应对真实场景中高复杂度的理解挑战。
2. Grouped Non-Linear评估机制
以往的Benchmark多采用「散点评测」范式,即对每个问题独立计分。这种方式往往忽略问题之间的关联关系,进而影响评测的效率与鲁棒性。为此,我们引入分组式评估机制,并设计了2类关键任务组:
能力一致性
分组方式
该类分组用于验证模型对某一具体能力的真实掌握程度。针对单项能力,我们围绕同一能力的不同方面,在问题组内构造4个不同层级的问题。以视频计数为例,我们分别评估:单帧内运动员人数、单片段内动作种类、跨片段同一动作的执行次数以及全视频的片段总数。通过从局部到全局、从静态到跨时序的多层次考察,高效区分模型是「偶然答对」,还是在该能力上具有鲁棒的表现。
计分方法
我们统计每个问题组内答对的问题数量N,并将该组得分定义为(N/4)²。这种非线性增益的计分方式能够降低单题命中带来的偶然性收益,同时强化组内整体一致性与稳定正确的奖励。
推理连贯性
分组方式
该类分组侧重评估模型的推理链严密性。不同于仅关注最终答案,我们在问题组内引入中间过程监督,围绕同一推理链的关键逻辑节点构造4个递进式问题进行显示验证。以行为推断为例,若视频中人物通过「假死」实现瞒天过海,我们分别评估:模型是否能够识别死亡表象的直接线索、捕捉与常规情形不一致的反常细节、推断假死行为的目的、以及在前述证据约束下给出最终结论。通过这种「线索定位—反常核验—目的解释—结论闭环」的层级化检验,更高效区分模型是否真正基于视频证据进行连贯推理。
计分方法
在非线性计分的基础上,我们进一步引入「首错截断」机制:评估时以模型在组内首次出错的节点作为有效推理的边界,仅统计该断点之前连续答对的问题数量。该机制能够抑制错误前提下继续推导带来的「伪正确」得分,从而更严格地约束推理链的有效性。
3. 数据标注与质检
Video-MME-v2的评估体系对人工标注质量提出了极高要求。为此,我们建立了一套完整、严密的数据标注与质量检测规范。在花费3300人工时后,最终收录800个视频,每个视频配套4个问题,每个问题提供8个选项。
数据标注
我们组建了由12名人类专家组成的标注团队,负责视频数据的采集与全流程标注,并通过「流程化规范+交叉校验」进行严格质控:从视频筛选、问题设计到选项构造,各环节均严格对齐统一标准。同时设置3轮交叉复核与修改,对每条样本进行反复打磨,尤其注重能力一致性与逻辑连贯性约束。包括修改2轮数据质检意见的时间在内,共花费约2200人工时。
数据质检
我们组建了由50名专家组成的独立质检团队,以尽可能降低人工标注的主观偏差及大模型先验等因素对数据质量的影响。每个视频样本均由至少2名质检人员独立负责,围绕视频内容、问题表述、选项设置以及Gemini-3-Pro纯文本测试结果开展两轮全方位核查,并与标注团队协作,对所有修改项进行复核与确认。此外,质检团队还对数据集进行了全量人工做题,以统计人类答题准确率。上述流程累计投入约1100人工时。