Video-MME-v2

Dataset Examples

三、数据集示例

Each level contains examples from two evaluation dimensions. Scroll horizontally to view more, click QA tabs to switch questions.

每个层级包含两种评测维度的示例。左右滑动查看更多，点击 QA 标签切换问答内容。

Consistency — 4 questions test the same capability across different dimensions/granularities — 4个问题在不同维度/粒度下测试同一能力

Coherence — 4 questions form a logical chain, each depends on prior answers — 4个问题构成逻辑链，后续问题依赖前序答案

Level 1: Information Retrieval & Aggregation

Level 1: 信息检索与聚合

Basic visual perception — object recognition, counting, attribute judgment, scene understanding

基础视觉感知 — 对象识别、计数、属性判断、场景理解

Consistency

☕ Barista Shift — Basic Counting ☕ 咖啡师工作 — 基础计数

Capability Consistency: All 4 questions test basic counting. Same granularity, different counting targets: latte art appearances → take-away orders → dine-in orders → total cups. Answerable by reading key frames; no temporal reasoning required, belonging to Level 1 (Retrieval & Aggregation).

能力一致性测评：4个问题均测试基础计数，同一粒度但不同计数目标（拉花次数→外带订单→堂食订单→总杯数）。问题可通过阅读视频中的某些帧作答，无需理解时序，属于 Level 1（信息检索与聚合）。

Q1: How many latte art appearances were there in the video?

Q1: 视频中出现了几次拉花？

A. 8.

B. 5.

C. 7.

✓ D. 9.

E. 6.

F. 4.

G. 10.

H. 3.

Q2: How many take-away coffee orders appeared in the video?

Q2: 视频中出现了几个外带咖啡订单？

✓ A. 4.

B. 2.

C. 3.

D. 5.

E. 1.

F. 6.

G. 7.

H. 8.

Q3: How many dine-in coffee orders appeared in the video?

Q3: 视频中出现了几个堂食咖啡订单？

✓ A. 6.

B. 4.

C. 5.

D. 7.

E. 3.

F. 8.

G. 9.

H. 2.

Q4: How many cups of coffee beverages were sold in total in the video?

Q4: 视频中总共卖出了多少杯咖啡饮品？

A. 9.

B. 7.

C. 8.

D. 6.

✓ E. 10.

F. 5.

G. 11.

H. 12.

Coherence

🏹 Archery — Numerical Calculation 🏹 射箭运动 — 数值计算

Reasoning Coherence: Logical dependency tree [1,[2,3],4]. Q2 & Q3 derive rates from Q1's total. Q4 compares the results of Q2 & Q3. Questions can be answered by reading certain frames in the video, without requiring temporal understanding or cross-modal association, belonging to Level 1 (Retrieval & Aggregation).

推理连贯性测评：逻辑依赖树 [1,[2,3],4]。Q2 和 Q3 基于 Q1 的总量计算命中率，Q4 比较 Q2 和 Q3 的结果。问题可以通过阅读视频中的某些帧来作答，无需理解时序或跨模态关联，属于 Level 1（信息检索与聚合）。

Step 1 / 4: Establish Denominator

Q1: How many arrows did each of the two participants shoot in the video?

Q1: 视频中两位参与者各射出了多少支箭？

A. 5.

B. 7.

✓ C. 6.

D. 8.

E. 4.

F. 9.

G. 10.

H. 12.

Step 2 / 4: Calculate Rate A

⚡ Depends on Q1: Needs total arrows (6) to calculate percentage.

⚡ 依赖 Q1：需要总箭数（6）来计算百分比。

Q2: What was the hit rate of the participant in black on the target's valid area?

Q2: 黑衣参与者在靶心有效区域的命中率是多少？

A. 100%.

B. 16.7%.

C. 33.3%.

D. 20%.

E. 57.1%.

✓ F. 66.7%.

G. 31.3%.

H. 68.9%.

Step 3 / 4: Calculate Rate B

⚡ Depends on Q1: Needs total arrows (6) to calculate percentage.

⚡ 依赖 Q1：需要总箭数（6）来计算百分比。

Q3: What was the hit rate of the participant in white on the target's valid area?

Q3: 白衣参与者在靶心有效区域的命中率是多少？

✓ A. 33.3%.

B. 50%.

C. 16.7%.

D. 66.7%.

E. 100%.

F. 68.9%.

G. 20%.

H. 57.1%.

Step 4 / 4: Logical Comparison

⚡ Depends on Q2 & Q3: Compares Black (66.7%) vs White (33.3%).

⚡ 依赖 Q2 & Q3：比较黑衣（66.7%）与白衣（33.3%）的命中率。

Q4: How did the hit rates of the two participants compare?

Q4: 两位参与者的命中率相比如何？

A. Black was 33.4% lower than White.

B. Black was 50% higher than White.

C. Black was 83.3% lower than White.

✓ D. Black was 33.4% higher than White.

E. Black was 50% lower than White.

F. Black was 83.3% higher than White.

G. Black was 35.6% higher than White.

H. They had the same hit rate.

Consistency

🧠 ADHD Explanation — Vision-Guided Audio Description 🧠 ADHD 解释 — 视觉引导的音频描述

Capability Consistency: All 4 questions test vision-guided audio description (Frames & Audio), same granularity but different scene–audio mappings (Q1 clothes stacked on shelves → Q2 brushing teeth with ball hitting head → Q3 colorful balls falling down stairs → Q4 chatting outside tent). Answerable by correlating frames with simultaneous audio; no temporal ordering required, belonging to Level 1 (Retrieval & Aggregation).

能力一致性测评：4个问题均测试视觉引导的音频描述（视频帧+音频），同一粒度但不同场景–音频对应（Q1衣物堆在货架场景→Q2刷牙被球砸场景→Q3彩色球滚落楼梯场景→Q4帐篷外聊天场景）。问题可通过将帧与同时发生的音频关联作答，无需理解时序，属于 Level 1（信息检索与聚合）。

Q1: When the video shows clothes stacked on shelves and miscellaneous items like sunglasses and watches on the table, what is the protagonist talking about?

Q1: 当视频画面显示衣物堆在货架上，太阳镜和手表等杂物放在桌上时,主角在说什么？

A. There are so many things to do in life that he lacks the time and energy to clean his room.

B. There are too many things to do and remember every day, like thousands of balls in a tombola.

C. He is talking about the equipment and set preparation needed for filming videos.

D. He wants to explain his ADHD to his family by making a video himself.

E. He is explaining why his room is now tidier, because he has learned time management.

✓ F. Daily tasks that trouble him, such as losing his wallet or being late for work.

G. His daily work routine as a tour guide.

H. He is showing his collection of filming equipment and everyday personal items.

Q2: When the video shows a young man brushing his teeth and a green ball hitting him on the head, what is being said in the audio?

Q2: 当视频画面显示一个年轻人刷牙，一个绿色球砸到他头上时，音频中说了什么？

✓ A. Sometimes they're helpful...

B. He's choosing them completely randomly...

C. Do you see what I mean?

D. Every day it feels like I'm fighting my brain...

E. Our new toothpaste protects you from everything… well, almost everything.

F. I'm tired of the same old boring morning...

G. You're going to be very late.

H. A direct hit right out of nowhere!

Q3: In the short film made by the protagonist, when many colorful balls fall down the stairs, which description of the background music is most accurate?

Q3: 在主角制作的短片中，当许多彩色球从楼梯滚落时，以下哪个对背景音乐的描述最准确？

A. Fierce orchestral music accompanied by mechanical sounds reading out comments.

B. Upbeat electronic music accompanied by the audience's exclamations.

C. Slow piano music combined with whispers of negative comments.

D. Tense drum beats accompanied by faint laughter in the environment.

E. Calm string music synchronized with the crisp sound of items falling down the stairs.

F. Fast-paced rock music accompanied by distant crowd discussions.

✓ G. Intense symphonic music accompanied by noisy negative comments.

H. Melodious folk music combined with noisy negative comments.

Q4: In the scene where the protagonist is sitting outside a tent on chairs and chatting with a man wearing a black jacket and a red inner layer, what is the corresponding audio when the camera cuts to the man?

Q4: 在主角坐在帐篷外的椅子上，与一位穿黑色外套红色内搭的男子聊天的场景中，镜头切到该男子时对应的音频是什么？

A. So, how do you feel it went then?

B. Beautiful conversations I've had with them...

✓ C.Mission accomplished.

D. We were aware, you know, there's a lot of stuff going on...

E. Every day it feels like I'm fighting my brain...

F. The last three miles were absolutely brutal...

G. You're going to be very late.

H. Well done!

Consistency

🎻 Short Films — A-V Semantic 🎻 短片合集 — 音视频语义一致性

Capability Consistency: All 4 questions test cross-modal semantic consistency (Frames & Audio), same granularity but different story segments (Q1 first story mother handing ticket → Q2 second story "Bent Out of Shape" opening → Q3 third story girl helping little pilot → Q4 fourth story "Joy & Heron" ending). Each asks whether the visual tone aligns with the music mood. Answerable by perceiving frame–audio alignment; no sequential reasoning required, belonging to Level 1 (Retrieval & Aggregation).

能力一致性测评：4个问题均测试跨模态语义一致性（视频帧+音频），同一粒度但不同故事片段（Q1第一个故事母亲递车票→Q2第二个故事"Bent Out of Shape"开场→Q3第三个故事女孩帮助小飞行员→Q4第四个故事"Joy & Heron"结尾）。问题可通过感知画面色调与背景音乐情绪的对应作答，无需时序推理，属于 Level 1（信息检索与聚合）。

Q1: In the first story, when the mother hands the ticket to her son at the station, what is the relationship between the visual tone of the scene and the mood of the background music?

Q1: 在第一个故事中，当母亲在车站把车票递给儿子时，画面色调与背景音乐情绪的关系是什么？

A. The camera focuses on the mother's wrinkled hands and the ticket, and the background music consists of slightly anxious heartbeats and strings; the two are consistent.

B. The visual presents bright, warm tones, and the background music is slow and sad; the atmospheres of both are in contrast.

✓ C. The visual presents dark, cold tones, and the background music is slow and sad; the atmospheres of both are consistent.

D. The visual presents bright, warm tones, and the background music is cheerful and full of hope; the atmospheres of both are in contrast.

E. The scene shows the boy finally running to hug his mother, and the background music is stirring and warm; the atmospheres of both are consistent.

F. The visual tone is a documentary-style yellowish hue, and the background music is a nostalgic old song playing from the station broadcast; the two are consistent.

G. There is no background music, only ambient noise from the station, and the visual presents cold tones; the atmospheres of both are consistent.

H. The visual presents bright, warm tones, and the background music is cheerful and full of hope; the atmospheres of both are consistent.

Q2: In the opening part of the second story, "Bent Out of Shape", what is the relationship between the visual style and the mood of the background music when the square-shaped family sits on their porch drinking beverages?

Q2: 在第二个故事"Bent Out of Shape"的开场部分，当方形家庭坐在门廊喝饮料时，视觉风格与背景音乐情绪的关系是什么？

A. The visual style is angular but warmly colored, and the background music has a lively rhythm. The atmosphere of the two is consistent.

B. The visual style is angular and monochromatic, and the background music has a lively rhythm. The atmosphere of the two is consistent.

C. The visual style is angular and monochromatic, and the background music is slow and melancholic. The atmosphere of the two is consistent.

D. The visual style is rounded and warmly colored, and the background music has a lively rhythm. The atmosphere of the two is consistent.

E. The visual style is rounded and warmly colored, and the background music is slow and melancholic. The atmosphere of the two is opposite.

F. The visual style is angular but warmly colored, and the background music is cheerful. The atmosphere of the two is consistent.

G. The visual style is angular and monochromatic, there is no background music, only the sound of drinking. The atmosphere is very quiet.

✓ H. The visual style is angular and monochromatic, and the background music has a lively rhythm. The atmosphere of the two is opposite.

Q3: In the third story, when the girl helps the little pilot fix the spaceship and it takes off at night, what are the characteristics of the audiovisual atmosphere?

Q3: 在第三个故事中，当女孩帮助小飞行员修好飞船并在夜晚起飞时，视听氛围的特点是什么？

A. The visual depicts an imaginative cosmic flight, and the background music becomes optimistic and warm; the atmospheres of both are consistent.

B. The visual depicts an imaginative cosmic flight, and the background music becomes optimistic and warm; the atmospheres of both are in contrast.

✓ C. The visual shows a magnificent scene of flying through the starry sky, and the background music becomes stirring and cheerful; the atmospheres of both are consistent.

D. The visual shows a dark sky before a storm, and the background music becomes stirring and cheerful; the atmospheres of both are in contrast.

E. The visual shows a dark sky before a storm, and the background music becomes tense and rapid; the atmospheres of both are consistent.

F. The visual shows a magnificent scene of flying through the starry sky, and the background music becomes slow and sad; the atmospheres of both are in contrast.

G. The visual focuses on the girl's slender figure looking up at the departing spaceship, and the background music turns into an ethereal and mournful piano solo; the atmospheres of both are consistent.

H. The visual uses twinkling starlight to contrast with the girl's loneliness on the ground, and the background music turns into a low, lingering chant; the atmospheres of both are consistent.

Q4: In the fourth story, "Joy & Heron," what is the relationship between the visual tone and the mood of the music at the end, when the heron and the dog share the catch, and the fisherman and the dog watch the sunrise together in the boat?

Q4: 在第四个故事"Joy & Heron"中，当鹭鸟与狗分享渔获，渔夫和狗在船上一起看日出的结尾部分，视觉色调与音乐情绪的关系是什么？

✓ A. The visual shows a warm sunrise rising over a misty lake, and the background music is warm and peaceful; the atmospheres of both are consistent.

B. The visual shows a warm sunrise rising over a misty lake, and the background music is tense and comical; the atmospheres of both are in contrast.

C. The visual is a poetic scene of a golden sunset and a shimmering lake, and the background music is quiet and beautiful; the atmospheres of both are consistent.

D. The visual shows a gloomy lake before a storm, and the background music is warm and peaceful; the atmospheres of both are in contrast.

E. The visual shows a gloomy lake before a storm, and the background music is tense and comical; the atmospheres of both are consistent.

F. The visual shows bleak heavy snow and a setting sun, and the background music is an ethereal and sorrowful female hum; the atmospheres of both are consistent.

G. The visual shows the receding backs of the characters with blurred tones, and the background music is a fading, mournful melody; the atmospheres of both are consistent.

H. The visual shows a gloomy lake before a storm, and the background music is warm and peaceful; the atmospheres of both are consistent.

1 / 4

Level 2: Temporal Understanding

Level 2: 时序理解

Action recognition, state change tracking, temporal ordering, dynamic perception

动作识别、状态变化跟踪、时间顺序理解、动态感知

Consistency

🏀 NBA Game — Causal Reasoning 🏀 NBA 比赛 — 因果推理

Capability Consistency: All 4 questions test simple causal reasoning (Temporal Reasoning), same granularity but different causal targets (why a player left the court → why a free throw occurred → why a player was angry → tactical impact of a missed free throw). Requires understanding event order and cause–effect within the game timeline, belonging to Level 2 (Temporal Understanding).

能力一致性测评：4个问题均测试简单的因果推理（时序推理），同一粒度但不同因果目标（球员离场原因→罚球原因→球员愤怒原因→错失罚球的战术影响）。需理解比赛时间线中的事件顺序与因果关系，属于 Level 2（时序理解）。

Q1: Why did the Suns' player #3 leave the court when the score was 113:114?

Q1: 为什么太阳队3号球员在比分113:114时离场？

A. Because he was ejected due to regular fouls.

B. Because it was time for a rotation substitution.

✓ C. Because he was ejected due to a technical foul.

D. Because he could not continue due to an ankle injury.

E. Because he was subbed out by the coach due to poor performance.

F. Because he could not continue due to a rib injury from a collision.

G. Because he was protesting the officiating by refusing to play.

H. Because he could not continue due to excessive physical exhaustion.

Q2: Why did the Lakers' player #23 go to the free-throw line when the score was 113:114?

Q2: 为什么湖人队23号球员在比分113:114时走上罚球线？

A. Because a Suns player committed a blocking foul while defending the Lakers' player #23's layup.

B. Because the Lakers were fouled tactically on a fast break with no defender in front.

C. Because the Suns had fewer than 5 players on the court after a timeout.

D. Because the Suns had more than 5 players on the court after a timeout.

E. Because the Suns' coach argued with the referee and was assessed a technical foul.

F. Because the Suns' center committed a defensive three-second violation.

G. Because the Suns requested a timeout when they had none remaining.

✓ H. Because a Suns player committed a technical foul, and the Lakers' player #23 chose to execute the free throw himself.

Q3: Why did the Suns' player #3 feel dissatisfied and angry when the score was 113:114?

Q3: 为什么太阳队3号球员在比分113:114时感到不满和愤怒？

A. He missed a crucial three-point shot.

B. He repeatedly received "grenade" passes from teammates and had a difficult time on the court.

C. He believed the referee's officiating standards were inconsistent and biased.

✓ D. A Lakers player made a motion to crash into him.

E. He felt dissatisfied with the coach's substitution adjustments.

F. He believed the team did not call a timeout in time to make adjustments.

G. Fans on the sidelines were booing him.

H. He believed the referee made a serious missed call.

Q4: What impact did missing that technical free throw have on the tactics of both sides for the final possession?

Q4: 错失那次技术犯规罚球对双方最后一攻的战术有什么影响？

A. The offensive team needs to complete the attack quickly to fight for an extra offensive possession; the defensive team's strategy remains unchanged.

✓ B. The offensive team needs to actively complete the last attack; the defensive team's strategy remains unchanged.

C. The offensive team needs to make a three-pointer to overcome the deficit; the defensive team's strategy remains unchanged.

D. Neither the offensive nor defensive strategies are affected, as the game has entered garbage time.

E. The offensive team's strategy remains unchanged, needing only to run out the clock to win; the defensive team needs to foul quickly to gain an extra offensive possession.

F. The offensive team can choose to play conservatively to force overtime, or take some risk to attack actively; the defensive team's strategy remains unchanged.

G. The offensive team can only choose to shoot a three-pointer on the final possession; the defensive team only needs to focus on defending beyond the three-point line.

H. The offensive team's strategy remains unchanged; the defensive team only needs to focus on defending the paint.

Coherence

🏆 Competitions — Event Sequence Ordering 🏆 比赛项目 — 事件顺序排序

Reasoning Coherence: Linear logic chain [1,2,3,4]. Q1 asks for the order of the first two competitions → Q2 the first three → Q3 the first four → Q4 the full sequence of all competitions. Each step builds on prior answers. Requires tracking the competition order across the video timeline, belonging to Level 2 (Temporal Understanding / Order).

推理连贯性测评：线性逻辑链 [1,2,3,4]。Q1 问前两项比赛顺序 → Q2 前三项 → Q3 前四项 → Q4 全部比赛顺序。每步依赖前序答案。需追踪视频中比赛项目的时间顺序，属于 Level 2（时序理解 / 顺序）。

Step 1 / 4: First Two

Q1: What is the order of the first two competitions?

Q1: 前两项比赛的顺序是怎样的？

A. Deadlift, arm wrestling.

✓ B. Arm wrestling, deadlift.

C. Punch strength, arm wrestling.

D. Arm wrestling, punch strength.

E. Punch strength, deadlift.

F. Deadlift, punch strength.

G. Push-ups, arm wrestling.

H. Push-ups, punch strength.

Step 2 / 4: First Three

⚡ Depends on Q1: Extend to the order of the first three competitions.

⚡ 依赖 Q1：延伸至前三项比赛的顺序。

Q2: What is the order of the first three competitions?

Q2: 前三项比赛的顺序是怎样的？

A. Punch strength, deadlift, arm wrestling.

B. Arm wrestling, punch strength, deadlift.

C. Deadlift, punch strength, push-ups.

D. Arm wrestling, punch strength, push-ups.

✓ E. Arm wrestling, deadlift, punch strength.

F. Punch strength, push-ups, deadlift.

G. Deadlift, push-ups, punch strength.

H. Punch strength, push-ups, one-finger lift.

Step 3 / 4: First Four

⚡ Depends on Q2: Extend to the order of the first four competitions.

⚡ 依赖 Q2：延伸至前四项比赛的顺序。

Q3: What is the order of the first four competitions?

Q3: 前四项比赛的顺序是怎样的？

✓ A. Arm wrestling, deadlift, punch strength, push-ups.

B. Arm wrestling, punch strength, deadlift, push-ups.

C. Punch strength, push-ups, deadlift, arm wrestling.

D. Punch strength, arm wrestling, deadlift, push-ups.

E. Deadlift, punch strength, push-ups, arm wrestling.

F. Deadlift, punch strength, push-ups, arm wrestling.

G. Push-ups, punch strength, arm wrestling, deadlift.

H. Push-ups, deadlift, punch strength, arm wrestling.

Step 4 / 4: Full Sequence

⚡ Depends on Q3: Reconstruct the full chronological order of all competitions.

⚡ 依赖 Q3：重建全部比赛项目的完整时间顺序。

Q4: What is the order of all the competitions?

Q4: 全部比赛项目的顺序是怎样的？

✓ A. Arm wrestling, deadlift, punch strength, push-ups, one-finger lift, muscle-ups.

B. Arm wrestling, push-ups, punch strength, one-finger lift, deadlift, muscle-ups.

C. Push-ups, punch strength, one-finger lift, arm wrestling, deadlift, muscle-ups.

D. Push-ups, muscle-ups, one-finger lift, deadlift, punch strength, arm wrestling.

E. One-finger lift, push-ups, punch strength, arm wrestling, muscle-ups, deadlift.

F. One-finger lift, deadlift, push-ups, muscle-ups, arm wrestling, punch strength.

G. Muscle-ups, push-ups, one-finger lift, arm wrestling, deadlift, punch strength.

H. Muscle-ups, arm wrestling, punch strength, push-ups, deadlift, one-finger lift.

Coherence

Relevance

🚗 RC Car Experiment — Scene Transformation Detection 🚗 遥控车实验 — 场景变换检测

Capability Consistency: All 4 questions test scene transformation detection, same granularity but different modification targets across trial runs (added weights → spring attachment → pothole width adjustment → combined changes). Requires tracking what changed between runs over time, belonging to Level 2 (Temporal Understanding).

能力一致性测评：4个问题均测试场景变换检测，同一粒度但不同改装目标（配重→弹簧→坑洞宽度→组合变化）。需追踪不同测试轮次间的变化，属于 Level 2（时序理解）。

Q1: When testing with the original car at 25% speed, how did the width of the pothole change?

Q1: 当以25%的速度测试原始小车时，坑洞的宽度是如何变化的？

A. It gradually widened from 4cm to 6cm, 8cm, 10cm, 12cm, and 14cm.

✓ B. It gradually widened from 4cm to 6cm, 8cm, 10cm, and 14cm.

C. It gradually widened from 4cm to 6cm, 10cm, 12cm, and 14cm.

D. It gradually widened from 4cm to 6cm, 8cm, 10cm, and 12cm.

E. It gradually widened from 4cm to 6cm, 8cm, 9cm, 10cm, 12cm, and 14cm.

F. It gradually widened from 4cm to 6cm, 8cm, 9cm, 10cm, 11cm, 12cm, and 14cm.

G. It gradually widened from 4cm to 6cm, 8cm, 10cm, 12cm, 13cm, and 14cm.

H. It gradually widened from 4cm to 6cm, 10cm, 12cm, and 14cm.

Q2: Compared to the previous experiment, what modification was made to the car in the experiment where it successfully crossed the 20cm pothole?

Q2: 与上一次实验相比，在成功越过20cm坑洞的实验中，小车做了什么改装？

A. An elastic device was installed on the front wheels.

B. Weight was added to the rear part of the vehicle.

C. Weight was added to the front part of the vehicle.

D. Two rods were added to the front part of the vehicle.

E. Two rods were added to the rear part of the vehicle.

F. The speed of the car was increased.

✓ G. An elastic device was installed on the rear wheels.

H. The weight of the car was increased.

Q3: Compared to the previous experiment, what modification was made to the car in the experiment where it successfully crossed the 14cm pothole?

Q3: 与上一次实验相比，在成功越过14cm坑洞的实验中，小车做了什么改装？

A. The yellow weight was moved from the front part of the vehicle to the rear part.

B. The yellow weight on the front part of the vehicle was decreased.

C. The yellow weight on the rear part of the vehicle was increased.

✓ D. The black weight was moved from the front part of the vehicle to the rear part.

E. The diameter of the wheels was increased.

F. The red weight was added to both the front and rear parts of the vehicle respectively.

G. Screws were added to the bottom of the vehicle.

H. The initial speed of the car was significantly increased.

Q4: Compared to the previous experiment, what modification was made to the car in the experiment where it successfully crossed the 24cm pothole?

Q4: 与上一次实验相比，在成功越过24cm坑洞的实验中，小车做了什么改装？

A. The mass of the weight was increased.

✓ B. The weight was moved to the top.

C. The weight was moved to the rear.

D. The mass of the weight was reduced.

E. The speed of the car was increased.

F. The weight was moved to the front.

G. An elastic device was installed on the rear wheels.

H. The suspension stiffness was increased.

Coherence

💥 Butty's Kitchen Disaster — Causal Reasoning 💥 巴蒂的厨房灾难 — 因果推理

Reasoning Coherence: Logic tree [1,[2,3],4]. Q1 identifies the core accident. Q2 analyzes cause and Q3 analyzes consequent action, both branching from Q1. Q4 concludes based on total mess. Requires understanding event order and cause–effect relationships within the video timeline, belonging to Level 2 (Temporal Understanding).

推理连贯性测评：逻辑树 [1,[2,3],4]。Q1 确定核心事故。Q2 分析原因和 Q3 分析后续行为，两者都从 Q1 分支出来。Q4 基于整体混乱得出结论。需理解视频时间线中的事件顺序与因果关系，属于 Level 2（时序理解）。

Step 1 / 4: The Core Event

Q1: What happened while Butty was making lasagna?

Q1: 巴蒂做千层面的时候发生了什么？

A. The lasagna spilled onto the carpet.

B. The dish with the lasagna didn’t fit into the oven.

C. Butty tripped and fell while carrying the lasagna.

✓ D. The oven exploded.

E. Butty didn’t know how long to set the oven timer for.

F. There was no prepared lasagna inside the refrigerator.

G. The lasagna spilled inside the refrigerator.

H. The lasagna in the refrigerator had already gone bad.

Step 2 / 4: Analyze Cause

⚡ Depends on Q1: Explaining why the oven exploded.

⚡ 依赖 Q1：解释烤箱为什么会爆炸。

Q2: Why did the oven explode?

Q2: 烤箱为什么会爆炸？

A. While cleaning the lamp, Butty pulled it down and caused an electrical failure.

B. Butty removed the oven knob and replaced it with an unsuitable substitute.

C. Butty got distracted staring out the window and forgot to stop the oven.

D. Butty was gazing out the window while cleaning it.

E. The oven was old and poorly maintained, leading to a short circuit.

F. When Butty put the lasagna inside, it spilled and caused a short circuit.

✓ G. Butty accidentally removed the oven knob and reattached it incorrectly.

H. Butty set the oven to the wrong mode.

Step 3 / 4: Analyze Action

⚡ Depends on Q1: The explosion created the mess (stains) that he is trying to clean.

⚡ 依赖 Q1：爆炸造成了混乱（污渍），他正在试图清理。

Q3: Why did Butty stand on clutter?

Q3: 巴蒂为什么站在杂物堆上？

✓ A. The earlier explosion left stains all over the lamp.

B. The lady of the house told Butty to clean the entire kitchen.

C. The lamp was covered in dust, and Butty needed to clean it while tidying.

D. The lady wanted the house spotless to welcome guests.

E. The lampshade was rusted and needed cleaning.

F. The earlier explosion caused a power outage, and Butty was trying to fix the lamp.

G. Butty piled up some clutter while cleaning and used it as a step to reach the chandelier.

H. The explosion shattered the bulb, and Butty was replacing it.

Step 4 / 4: Final Outcome

⚡ Depends on Q2 & Q3: The combination of the explosion and the subsequent mess forces him to leave.

⚡ 依赖 Q2 & Q3：爆炸和随后的混乱加在一起，迫使他离开。

Q4: Why did Butty run away from home?

Q4: 巴蒂为什么要离家出走？

A. Butty pulled the chandelier down while cleaning it.

B. Butty didn’t clean the kitchen properly.

C. Butty broke the oven while making lasagna.

D. An explosion happened while Butty was making lasagna.

✓ E. Butty made a huge mess in the house.

F. The lady of the house no longer needed Butty’s help.

G. The lady asked Butty to prepare for a party, but he didn’t finish on time.

H. The lady was hosting a party and needed Butty to leave temporarily.

1 / 4

Level 3: Complex Reasoning

Level 3: 复杂推理

Causal reasoning, intention understanding, physical world reasoning, social behavior inference

因果推理、意图理解、物理世界推理、社会行为推断

Consistency

🎩 Shell Game — Entity Persistence Tracking 🎩 魔术杯球 — 实体持久性追踪

Capability Consistency: All 4 questions test entity persistence and perspective taking (Physical World Reasoning), same granularity but different tracking targets (Q1-Q2 track from performer's perspective → Q3 switches to audience perspective → Q4 counts total position changes). Requires complex spatial reasoning and perspective switching, belonging to Level 3 (Complex Reasoning).

能力一致性测评：4个问题均测试实体持久性和视角转换（物理世界推理），同一粒度但不同追踪目标（Q1-Q2从表演者视角追踪→Q3切换到观众视角→Q4统计总位置变化次数）。需复杂空间推理与视角切换，属于 Level 3（复杂推理）。

Q1: In Round 1, after the first swap, under which shell is the ball located from the performer's perspective?

Q1: 第一轮游戏中，第一次交换后，从表演者视角看，球位于哪个贝壳下？

A. The second shell on the left.

B. The middle shell.

✓ C. The first shell on the left.

D. The fourth shell on the left.

E. The first shell on the right.

F. The second shell on the right.

G. The fourth shell on the right.

H. Cannot be determined.

Q2: In Round 1, after the second swap, under which shell is the ball located from the performer's perspective?

Q2: 第一轮游戏中，第二次交换后，从表演者视角看，球位于哪个贝壳下？

A. The first shell on the left.

B. The third shell on the left.

C. The fourth shell on the left.

D. The first shell on the right.

E. The third shell on the right.

✓ F. The second shell on the left.

G. The fourth shell on the right.

H. Cannot be determined.

Q3: In Round 1, after the third swap, under which shell is the ball located from the perspective of the audience member wearing white?

Q3: 第一轮游戏中，第三次交换后，从穿白衣的观众视角看，球位于哪个贝壳下？

A. The first shell on the left.

✓ B. The second shell on the left.

C. The third shell on the left.

D. The fourth shell on the left.

E. The first shell on the right.

F. The third shell on the right.

G. The fourth shell on the right.

H. Cannot be determined.

Q4: In Round 1 of the shell game, how many times does the ball change position?

Q4: 在第一轮游戏中，球的位置总共变换了多少次？

A. 0.

B. 1.

C. 3.

D. 4.

E. 5.

F. 6.

G. 7.

✓ H. 2.

Coherence

📐 Geometry Puzzle — Math Reasoning 📐 几何谜题 — 数学推理

Reasoning Coherence: Logic tree [[1,2],3,4]. Q1 and Q2 solve independent variables (xb and ya). Q3 combines Q1 & Q2 to find the total area. Q4 derives the final value A from Q3. Requires multi-step mathematical reasoning and logical deduction, belonging to Level 3 (Complex Reasoning).

推理连贯性测评：逻辑树 [[1,2],3,4]。Q1 和 Q2 求解独立变量（xb 和 ya）。Q3 结合 Q1 和 Q2 计算总面积。Q4 从 Q3 推导最终值 A。需多步数学推理与逻辑推导，属于 Level 3（复杂推理）。

Step 1 / 4: Solve Component 1

Q1: What is the relationship between x and b?

Q1: x 和 b 之间的关系是什么？

A. xb=2.

B. xb=4.

✓ C. xb=6.

D. xb=3.

E. xb=5.

F. xb=7.

G. xb=8.

H. xb=1.

Step 2 / 4: Solve Component 2

Q2: What is the relationship between y and a?

Q2: y 和 a 之间的关系是什么？

✓ A. ya=8.

B. ya=6.

C. ya=4.

D. ya=20.

E. ya=18.

F. ya=12.

G. ya=16.

H. ya=10.

Step 3 / 4: Calculate Total Area

⚡ Depends on Q1 & Q2: Combining partial products (xb=6, ya=12) to determine the full area.

⚡ 依赖 Q1 & Q2：结合部分乘积（xb=6, ya=12）来确定总面积。

Q3: What is the area of the entire rectangle?

Q3: 整个矩形的面积是多少？

A. 18.

B. 28.

C. 14.

D. 16.

E. 20.

F. 22.

✓ G. 24.

H. 26.

Step 4 / 4: Final Derivation

⚡ Depends on Q3: Deriving the specific unknown value A using the calculated total area.

⚡ 依赖 Q3：利用计算出的总面积推导特定未知数值 A。

Q4: What is the value of A?

Q4: A 的值是多少？

A. 12.

✓ B. 18.

C. 8.

D. 6.

E. 16.

F. 20.

G. 22.

H. 24.

Consistency

📱 The Social Detox — Narrative Turning Points 📱 社交戒断 — 叙事转折点检测

Capability Consistency: All 4 questions test narrative turning point detection (Complex Plot Comprehension), same granularity but different turning point targets (decision to isolate → first attempt to reconnect → second attempt to reconnect → return to isolation). Requires understanding complex psychological states and narrative structure, belonging to Level 3 (Complex Reasoning).

能力一致性测评：4个问题均测试叙事转折点检测（复杂剧情理解），同一粒度但不同转折点目标（决定隔离→首次尝试重新连接→第二次尝试重新连接→回归隔离）。需理解复杂心理状态与叙事结构，属于 Level 3（复杂推理）。

Q1: Which event causes the male protagonist to decide to throw away his phone?

Q1: 什么事件导致男主角决定扔掉他的手机？

A. While eating breakfast, he notices himself unconsciously making a phone-scrolling gesture.

B. His failed proposal is watched and mocked by a crowd.

C. He looks at his phone during the proposal, causing it to fail, and the scene is recorded by others.

D. He realizes that he is checking his phone even during the proposal.

E. He forgets to prepare for the proposal because he is addicted to his phone beforehand.

F. When rewatching the proposal video, he discovers that he keeps his head down scrolling on his phone the entire time.

G. After the proposal ends, he realizes that he never truly looked into the other person's eyes.

✓ H. A video of his failed proposal is recorded and shared on social media.

Q2: What is the turning point when the male protagonist first decides to resume his social life?

Q2: 男主角第一次决定恢复社交生活的转折点是什么？

A. An advertisement flyer for an Italian restaurant.

✓ B. An invitation to a disco dance party.

C. A phone that he cannot bring himself to throw away.

D. His family coming to persuade him to go outside.

E. A girl he meets at the dance party.

F. The people at the dance party enjoying music and dancing.

G. Lasagna made by his mother.

H. A phone call from an unknown number.

Q3: What is the turning point when the male protagonist decides for the second time to resume his social life after isolating himself?

Q3: 在自我隔离后，男主角第二次决定恢复社交生活的转折点是什么？

A. Answering a phone call from the girl he met at the dance party.

B. Attending the dance party and dancing and playing with the girl.

C. Climbing into a recycling bin to try to retrieve his phone.

✓ D. Leaving his phone number to a girl at the dance party.

E. Returning to social media and scrolling through a large number of videos.

F. Attending the dance party and immersing himself in the music and dancing.

G. His family coming to persuade him to go outside.

H. Picking up the lasagna made by his mother.

Q4: Which event marks the male protagonist's return to a "quiet life" after briefly resuming his social life?

Q4: 什么事件标志着男主角在短暂恢复社交生活后回归「平静生活」？

A. Turning off his phone after scrolling through a large number of social media videos.

✓ B. Hanging up on the phone call from the girl he met at the dance party.

C. Burying his phone underground.

D. Throwing his phone into a recycling bin.

E. No longer using electronic devices such as a radio or alarm clock.

F. Placing a paper shredder at the doorway to refuse incoming flyers.

G. Making an emergency call that gets hung up on.

H. Being rejected by a girl at the dance party.

Consistency

🪵 Wood Carving — General Skills Acquisition 🪵 木雕工艺 — 通用技能获取

Capability Consistency: All 4 questions test video-based knowledge acquisition (General Skills), same granularity but different knowledge targets (material selection → reversed operation logic → error analysis → cutting technique analysis). Requires extracting and applying technical knowledge from instructional videos, belonging to Level 3 (Complex Reasoning).

能力一致性测评：4个问题均测试基于视频的知识获取（通用技能），同一粒度但不同知识目标（选材→镜像操作逻辑→错误分析→刀法辨析）。需从教学视频中提取并应用技术知识，属于 Level 3（复杂推理）。

Q1: According to the video, how should the wood for making a hummingbird carving be selected?

Q1: 根据视频，制作蜂鸟雕刻应如何选择木材？

A. Choose alder branches with a thicker main trunk and thinner side branches.

✓ B. Choose branches of any type of tree with a thicker main trunk and thinner side branches.

C. Choose branches of any type of tree with a thicker main trunk and side branches.

D. Choose branches of any type of tree with a thinner main trunk and thicker side branches.

E. Choose alder branches with a thicker main trunk and side branches.

F. Choose willow branches that have been fully dried.

G. Choose straight-grained softwood with the grain direction aligned with the design shape.

H. Choose lightweight wood with a stable source, free of visible wormholes and cracks.

Q2: If the carving methods are followed exactly but the hummingbird’s head orientation is reversed (left-right), which approach is most accurate?

Q2: 如果完全按照视频方法雕刻，但蜂鸟头部的左右朝向与视频相反，哪种操作最准确？

A. First use push cuts to remove wood from the lower areas on both sides of the head to make it left–right symmetrical, then use a V cut to remove wood from the lower-right area of the head, and finally use a sweeping cut to remove wood from the upper-left area of the head.

B. Use a push cut to remove wood from the lower-right area of the head, and use only push cuts on the left side to remove excess material from the upper area.

C. Use a V cut to remove wood from the lower-right area of the head, and use only push cuts on the left side to remove excess material from the upper area.

D. Use a V cut to remove wood from the lower-left area of the head, and use only push cuts on the right side to remove excess material from the upper area.

E. First use push cuts to remove wood from the lower areas on both sides of the head to make it left–right symmetrical, then use a V cut to remove wood from the lower-left area of the head, and finally use a sweeping cut to remove wood from the upper-right area of the head.

F. First use a sweeping cut to remove wood from the lower areas on both sides of the head to make it left–right symmetrical, then use a V cut to remove wood from the upper-right area of the head, and finally use a push cut to remove wood from the upper-left area of the head.

✓ G. Use a push cut to remove wood from the lower-left area of the head, and use only push cuts on the right side to remove excess material from the upper area.

H. First use a sweeping cut to remove wood from the lower areas on both sides of the head to make it left–right symmetrical, then use a V cut to remove wood from the lower-left area of the head, and finally use a push cut to remove wood from the upper-left area of the head.

Q3: Which statement about errors and possible causes in carving the hummingbird's body and tail is the most accurate?

Q3: 关于雕刻蜂鸟身体和尾部时的错误及可能原因，哪种说法最准确？

A. The body is too long and the tail is too short, possibly because raising the inner curve of the body during carving changed the body-to-head length ratio.

B. The body is too long and the tail is too short, possibly because the body-to-tail length ratio of approximately 2:1 was not ensured when drawing the outline before carving.

C. The body is too long and the tail is too short, possibly because using rough cut on the tail damaged the overall structure.

✓ D. The body is too long and the tail is too short, possibly because the body and head length ratio of approximately 2:1 was not ensured when drawing the outline before carving.

E. The tail tip and the lines on the abdomen are too sharp, possibly because push cut was used during carving.

F. The tail tip and the lines on the abdomen are too sharp, possibly because small push cut was used during carving.

G. The tail tip and the lines on the abdomen are too sharp, possibly because rough cut was used initially to carve the general shape.

H. The tail tip and the lines on the abdomen are too sharp, possibly because a pencil outline was not drawn before carving.

Q4: Which statement about the carving methods used in the video is the most accurate?

Q4: 关于视频中使用的雕刻技法，哪种说法最准确？

A. Push cut is often used to make shape lines more delicate and smooth, such as refining the head contour and carving wing edge contours.

B. Push cut is often used to make shape lines more delicate and smooth, such as refining the head contour and shaping the connection curve between the head underside and body.

C. Rough cut is often used to carve clear and sharp lines, such as shaping the long tail feather lines.

D. Sweeping cut can be used both for rough carving and refinement, such as carving the general head contour and enhancing the smoothness and symmetry of the body lines.

E. Rough cut can be used both for carving clear and sharp lines and for refinement, such as carving the general tail shape and refining wing contours.

F. V cut is mainly used for later refinement, such as deepening wing edges, and is not used for rough carving due to low efficiency.

G. Sweeping cut is mainly used for carving general shapes and smooth lines, such as carving the head's approximate shape and the connection between the upper head and beak, and is not used for later refinement due to space limitations.

✓ H. V cut can be used for later refinement, such as deepening wing edges and clarifying the connection line between the head and beak.

Coherence

💦 Water Gun Battle — Counterfactual Reasoning 💦 水枪大战 — 反事实推理

Reasoning Coherence: Linear logic chain [1,2,3,4]. Q1 establishes elimination rules → Q2 identifies first death scenario → Q3 analyzes attacker's preparation → Q4 performs counterfactual reasoning on weapon upgrade and tactics. Requires understanding game mechanics and hypothetical scenarios, belonging to Level 3 (Complex Reasoning).

推理连贯性测评：线性逻辑链 [1,2,3,4]。Q1 确立淘汰判定规则 → Q2 识别首次阵亡场景 → Q3 分析攻击者准备动作 → Q4 进行武器升级与战术的反事实推理。需理解游戏机制与假设场景，属于 Level 3（复杂推理）。

Step 1 / 4: Establish Game Rules

Q1: In this game, how do you determine that a player has been eliminated?

Q1: 在这个游戏中，如何判断一名玩家已被淘汰？

✓ A. Weapon switch.

B. Opponent is hit by more than 5 shots.

C. Opponent jumps into the water.

D. Opponent can no longer retaliate.

E. You emit a cheer sound.

F. Opponent stops attacking you.

G. Opponent can no longer move.

H. Opponent plays a death sound effect.

Step 2 / 4: Identify Death Event

⚡ Depends on Q1: Knowing elimination is marked by weapon switch → When was the first-person player first eliminated? What weapon and target?

⚡ 依赖 Q1：已知通过武器切换判定淘汰 → 主视角玩家首次被淘汰时使用何武器、瞄准何目标？

Q2: In the video, when the first-person player is eliminated for the first time, which water gun is he using, and which player is he attempting to attack?

Q2: 在视频中，当第一人称玩家第一次被淘汰时，他正在使用哪把水枪，并且试图攻击哪个玩家？

A. He is using Spyra LX and is attempting to shoot the player running towards him from the right.

B. He is using the red Spyra LX and aiming at a player who is also on the lawn.

C. He is using Nerf Hyper Balls and attempting to throw them at a player.

D. He is holding Spyra LX, and the target is the person in the swimming pool.

E. He was holding a Roblox Big Paintball gun and attempting to attack players who were jumping off the diving board.

F. He was using an inflatable dolphin water gun and attempting to attack a player, but was ambushed from the side by another player.

✓ G. He is using an X-Shot Pump Action water gun, and the target he is aiming at is a player on the yard's lawn.

H. His X-Shot Pump Action had just run out of water and he was preparing to jump into the pool to refill it.

Step 3 / 4: Analyze Attacker Setup

⚡ Depends on Q2: Knowing the player was killed while using X-Shot targeting lawn → What was the lawn player (attacker) doing? What terrain advantage?

⚡ 依赖 Q2：已知玩家使用 X-Shot 瞄准草坪玩家时被击杀 → 草坪上的攻击者在做什么准备？利用了何种地形优势？

Q3: Before eliminating the first-person player for the first time, what preparatory action was the attacker (whom the first-person player was trying to target) performing? And which specific terrain feature did he use to gain a shooting advantage?

Q3: 在第一次淘汰第一人称玩家之前，攻击者（即第一人称玩家试图瞄准的草坪玩家）正在进行什么准备动作？他利用了哪个具体的地形特征来获得射击优势？

A. He was performing an acrobatic jump shot off the diving board, using height to gain an advantage.

B. He was running up the stairs from the lower courtyard, allowing him to attack from an unexpected angle.

✓ C. He was hiding behind a large potted plant in the courtyard, using surprise for a close-range ambush.

D. He was quickly refilling his weapon directly from the pool, enabling him to rejoin combat immediately.

E. He was sliding down the water slide, using momentum to rapidly close the distance for an attack.

F. He remained completely still, aiming for over five seconds, patiently waiting for the perfect shot from the far end of the yard.

G. He had just respawned and used his brief invincibility period to eliminate an opponent from an open position with ease.

H. He was reloading his red water gun, and his position on the elevated lawn gave him a clear line of sight over the pool.

Step 4 / 4: Counterfactual Reasoning

⚡ Depends on Q1, Q2, Q3: If the player survives that X-Shot encounter → What weapon upgrade per video sequence? What next target based on new weapon traits?

⚡ 依赖 Q1、Q2、Q3：若玩家使用 X-Shot 时存活并完成击杀 → 根据视频武器升级顺序将获得何武器？基于新武器特性的下一目标是什么？

Q4: Assuming the main character player is not killed in the first deathmatch, but successfully eliminates his target with his current weapon, according to the weapon progression sequence shown in the video, which new weapon will he immediately upgrade to? Considering the characteristics of that new weapon, what is his most likely next target and action?

Q4: 假设主角玩家在第一次死斗中未被击杀，而是成功用当前武器淘汰了目标，根据视频中显示的武器升级顺序，他将立即升级到哪把新武器？考虑到该新武器的特点，他最可能的下一个目标和行动是什么？

A. He will upgrade to the ultimate skip balls. He will use the wide-ranging destructive power of this weapon to attack enemies in a crowd.

B. He will upgrade to Spyra LX. He will use this weapon's high-frequency firing rate to engage and eliminate his enemies in direct combat.

C. He will upgrade to Spyra LX. He will use the immense power of this weapon to attack enemies in the water.

D. He will obtain a bubble gun. His most reasonable action is to retreat to the stairs and utilize the weapon's range effect from a higher vantage point.

✓ E. He will upgrade to the ROBLOX BIG PAINTBALL. Using this compact weapon, he will sneak up on lone enemies.

F. He will upgrade to STOMP SOAKER. His most rational action is to ignore the enemies around him and continue to focus on attacking distant targets on the lawn.

G. He will upgrade to a rocket launcher. His most logical next move would be to set up a launch pad in the courtyard to attack players on large inflatable floating beds.

H. The game will end directly, as this is the last level up.

1 / 5

Dataset and Annotation

四、数据集与标注

To comprehensively evaluate the video understanding capabilities of multimodal large models, Video-MME-v2 has constructed a progressive hierarchical classification system. We have abandoned flat task stacking and instead divided capability dimensions into three cognitive stages: from basic information retrieval and aggregation, advancing to dynamic capture of temporal sequences, actions, and changes, and ultimately rising to complex reasoning about plots, the physical world, and social behaviors.

📊 Three-Level Capability Hierarchy

Fig. 1 below shows the three levels and their capability dimensions. Each level contains several categories; each category groups related sub-dimensions as follows.

Level 1 — Retrieval & Aggregation

Frame-Only (3 types) — Visual Recognition (object/attribute/scene); Basic Counting; Numerical Calculation (rates, comparisons).

Frames & Audio (4 types) — Cross-Modal Semantic Consistency (tone–mood alignment); Audio-Guided Visual Description; Vision-Guided Audio Description; Visual-Audio Collaborative Reasoning.

Level 2 — Level 1 + Temporal Understanding

Action & Motion (5 types) — Fine-Grained Action Recognition; Repetitive Action Counting; Temporal Action Localization; Motion Trajectory Estimation; Motion Properties Analysis.

Order (3 types) — Object Appearance Ordering; Event Sequence Ordering; Temporal Periodicity Detection.

Change (3 types) — Entity Existence Change Detection; Entity Attribute Change Detection; Scene Transformation Detection.

Temporal Reasoning (2 types) — Causal Reasoning (why/what-if); Future Event Prediction.

Level 3 — Level 2 + Complex Reasoning

Complex Plot Comprehension (4 types) — Narrative Turning Point Detection; Narrative Cloze Inference; Symbolic / Metaphorical Interpretation; High-Order Narrative Deconstruction.

Video-Based Knowledge Acquisition (2 types) — Professional Knowledge Acquisition; General Skills Acquisition.

Social Behavior Analysis (3 types) — Individual Social Cognition; Dyadic Interaction Dynamics; Collective Dynamics Analysis.

Physical World Reasoning (4 types) — Entity Persistence Tracking; Spatial Understanding; Counterfactual Reasoning; Counterintuitive Comprehension.

Video-MME-v2 Three-Level Capability Hierarchy — Figure 1: Video-MME-v2 three-level capability hierarchy — capability dimensions and their distribution across Level 1 (Retrieval & Aggregation), Level 2 (Temporal Understanding), and Level 3 (Complex Reasoning).

🔧 Annotation Pipeline

✓ Fully Human Expert-Led · Rigorous Multi-Stage Quality Assurance

1 Video Selection

Video source: Over 80% of the videos are YouTube uploads from 2025 onward, ensuring temporal freshness and reducing contamination risk.

Diversity coverage: A taxonomy with 4 top-level categories and 31 subcategories guarantees broad coverage of topics and visual styles.

Content quality control: View-count thresholds (about 85% of videos exceed 10,000 views) filter out low-quality, noisy samples at the source.

Manual decontamination: Classic films and flagship videos from top creators are manually removed to minimize evaluation bias from model memorization effects.

2 Question Design

Question group annotation: A team of 12 human experts annotate question groups, ensuring broad coverage for capability consistency and sufficient depth for reasoning coherence.

Rigor check: During annotation, Gemini-3-Pro is used in real time to test and verify question wording and answer settings, ensuring precision and robustness.

3 Option Design

High-confusion options: 8-option multiple-choice design improves discriminative power and evaluation strength.

Strong distractor design: Beyond regular distractors, each question includes at least one additional, carefully crafted distractor targeted around the correct answer and refined by human annotators to test fine-grained discrimination.

4 Quality Control

a. Text-Only Check

Use Gemini-3-Pro in text-only mode as a baseline to remove questions that can be solved without visual information, strictly controlling language priors and ensuring the necessity of multimodal perception.

b. Cross-Review

Conduct multiple rounds of cross review: each question is reviewed in three rounds by different annotators to eliminate semantic ambiguity, patch potential flaws and refine option design.

c. External Validation

Introduce 50 independent reviewers who did not participate in the original annotation; each video question is checked in at least two fine-grained passes to reduce subjective bias.

d. Re-Validation

Establish a revision–retest loop: any modified question is re-run under the text-only baseline and independently re-validated to ensure each round of changes yields controlled quality improvements.

📈 Data Statistics

Video length: As shown in Fig. 2, the average length is about 10.4 minutes. 99% are 20 minutes or shorter; 53% are 10 minutes or shorter. The distribution is relatively uniform and diverse.

Video category: As shown in Fig. 3, video types cover 4 major categories and 31 subcategories, including Sports & Competition (e.g. basketball, soccer), Lifestyle & Entertainment (e.g. variety shows, digital), Art & Literature (e.g. film, comics), and Knowledge & Education (e.g. AI, humanities & history).

Video publication time: As shown in Fig. 4, more than 80% of the videos were published after 2025, with nearly 40% published after October 2025.

Video view count: As shown in Fig. 5, the mean and median view counts are 4.83 million and 355 thousand respectively. 84.3% of videos exceed 10,000 views, and 94.4% exceed 1,000 views.

Questions & answers: As shown in Fig. 2, the average length of questions and answers increases from Q1 to Q4. This aligns with our Reasoning Coherence design: later questions in the sequence are harder and typically require more contextual description and more detailed answers.

Options distribution: As shown in Fig. 2, the mean word count across the 8 options is highly consistent.

Video-MME-v2 Data Statistics — Figure 2: Video length, and question and option length distribution

Video-MME-v2 video category distribution (sunburst) — Figure 3: Video category distribution

Video-MME-v2 Video Publication Time Distribution — Figure 4: Video publication time distribution

Video-MME-v2 Video View Count Distribution — Figure 5: Video view count distribution

为全面评估模态的视频理解能力，我们为Video-MME-v2构建了一套循序渐进的层级化能力体系。不再采用扁平化的任务堆砌，而是将能力维度划分为三个递进阶段：首先是对分散线索的信息检索与聚合；其次是对状态变化、动作序列等时序动态的准确建模；最终进一步要求模型结合真实世界知识与社会常识等外部先验，完成更高阶的复杂推理。

📊 三级能力层级

下图Fig. 1展示了能力层级。每层包含若干类，每类下为相关子维度，对应关系如下：

Level 1 — 信息检索与聚合

仅帧 (Frame-Only) 共3类 — 视觉识别；基础计数；帧相关的数值计算。

帧+音频 (Frames & Audio) 共4类 — 跨模态语义一致性；音频引导的视觉描述；视觉引导的音频描述；视听觉协同理解。

Level 2 — Level 1 + 时序

动作与运动 (Action & Motion) 共5类 — 细粒度动作识别；重复动作计数；时序动作定位；运动轨迹估计；运动属性分析。

顺序 (Order) 共3类 — 物体出现顺序；事件序列排序；时序周期性检测。

变化 (Change) 共3类 — 实体存在变化检测；实体属性变化检测；场景变换检测。

时序推理 (Temporal Reasoning) 共2类 — 因果推理；未来事件预测。

Level 3 — Level 2 + 复杂推理

复杂剧情理解 (Complex Plot Comprehension) 共4类 — 叙事转折点检测；叙事填空推理；符号/隐喻解读；高阶叙事解构。

基于视频的知识获取 (Video-Based Knowledge Acquisition) 共2类 — 专业知识获取；通用技能获取。

社会行为分析 (Social Behavior Analysis) 共3类 — 个体社会认知；二元互动动态；群体动态分析。

物理世界推理 (Physical World Reasoning) 共4类 — 实体持久性追踪；空间理解；反事实推理；反直觉理解。

Video-MME-v2 的三级能力层级分类体系 — Figure 1: Video-MME-v2三级能力层级：各能力维度及其在Level 1（信息检索与聚合）、Level 2（时序理解）、Level 3（复杂推理）中的分布

🔧 标注流程

✓ 完全由人类专家主导 · 严谨的多阶段质量保障

1 视频筛选

视频来源：数据集中80%+为2025年及以后发布的YouTube视频，以确保评估基准的时效性并降低数据污染风险。

多样性覆盖：构建包含4个一级类别、31个二级类别的分类体系，确保题材与风格广泛分布。

内容质量控制：引入播放量作为内容质量的近似指标，近85%的视频观看量超过10,000次，从源头减少低质噪声样本。

人工防污染：人工剔除经典影视作品及头部博主内容，最大限度规避模型「记忆效应」带来的评估偏差。

2 问题设计

问题组标注：由12名人工专家负责，确保能力一致性问题的覆盖广度，以及逻辑连贯性的问题深度。

严谨性校验：标注过程中实时使用Gemini-3-Pro进行对照测试与校验，确保题目表述与答案设置的严谨性。

3 选项设计

高困惑度选项：采用8选项设计，提高区分度与评测强度。

强干扰项设计：除常规强干扰项之外，每题至少额外设置一个基于真实答案针对性构造的迷惑选项，并经人工精细润色，以精准测试模型的细粒度辨析能力。

4 质量把控

a. 文本基线测试

利用Gemini-3-Pro在纯文本模式下进行基线测试，剔除无需视觉信息仍可作答的题目，从而严格控制语言先验偏差，确保多模态感知的必要性。

b. 交叉校验

实施多轮交叉评审流程，每道题目由标注人员进行3轮交叉审核，重点消除语义歧义、修正潜在漏洞并优化选项设置。

c. 独立盲测

引入50名未参与该标注任务的独立评估人员，对每个视频问题进行至少2轮细粒度核查，以更客观视角降低标注人员的主观偏差。

d. 闭环迭代验证

建立"修正-复验"的闭环机制：对修改后的题目再次执行纯文本检测与独立盲测，确保每一轮修订均带来可控的质量提升。

📈 数据统计

视频时长：如Fig. 2所示，平均长度约10.4分钟。其中99%在20分钟以内，53%在10分钟以内，整体时长分布较为均匀且多样。

视频类别：如Fig. 3所示，视频类型覆盖4大类、31小类，包括体育与竞技（如篮球、足球等）、生活与娱乐（如综艺、数码等）、艺术与文艺（如电影、漫画等），知识与教育（如人工智能、人文历史等）。

视频发布时间：如Fig. 4所示，80%+的视频发布于2025年之后，其中近40%发布于2025年10月之后。

视频观看量：如Fig. 5所示，视频观看量的均值和中位数分别为483万次和35.5万次，其中84.3%的视频观看量超过10,000 次，94.4%超过 1,000次。

问题与答案：如Fig. 2所示，问题与答案的平均长度从Q1到Q4呈递增趋势。这与「推理连贯性」设计一致：序列后段问题难度更高，通常需要更充分的上下文描述与更细致的作答。

选项分布：如Fig. 2所示，8个选项的平均词数高度一致。

Video-MME-v2 数据统计 — Figure 2：视频长度、以及问题与选项长度分布

Video-MME-v2 视频类别分布（旭日图） — Figure 3：视频类别分布，含4个一级类别、31个二级类别

Video-MME-v2 视频发布时间分布 — Figure 4：视频发布时间分布

Experiments & Analysis

五、实验与分析

We conducted systematic evaluation on a number of leading video multimodal large models; results are shown in the leaderboard above. Building on this, we summarize several representative experimental findings below.

我们对多款前沿多模态大模型进行了系统评测，结果如上述排行榜所示。在此基础上，我们进一步总结了一些代表性的实验现象。

📊 Advantage of Non-Linear Scoring非线性打分优势

We compare two metrics: group-based non-linear score (Non-Lin Score) and per-question average accuracy (Avg Acc).

我们对比了两种指标：基于分组的非线性得分（Non-Lin Score）与逐题统计的平均准确率（Avg Acc）。

1. Within-model comparison: Gemini-3-Pro and Gemini-3-Flash reach average accuracy of 66.1% and 61.1% respectively—well above passing level. Under our group-based non-linear scoring, however, their scores are 49.4% and 42.5%. This shows that even SOTA models rarely answer all related questions in a group correctly. By explicitly leveraging the group structure, our nonlinear scoring is less sensitive to isolated correct predictions and instead emphasizes consistency across related queries, thereby providing a more faithful assessment of true model capability.

1. 同一模型内对比：Gemini-3-Pro与Gemini-3-Flash的平均准确率分别为66.1%与61.1%，高于及格水平。但在分组非线性计分下，得分仅为49.4%与42.5%。这表明即使是SOTA模型，也难以在同一组内稳定地将多道关联问题全部答对。我们设计的非线性打分能够利用Group结构，降低对「零散命中」的敏感性。

2. Cross-model comparison: The ratio Non-Lin Score/Avg Acc reflects how much a model drops from single-question correctness to group-stable correctness, and thus indicates robustness. For example, Gemini-3-Pro achieves a ratio of approximately 75%, followed by Doubao-Seed-2.0-Pro-260215 at around 72%, and InternVL3-5-241B-A28B-Instruct at about 56%, while the smaller model LLaVA-Video-7B achieves only around 40%. A lower ratio means the model more often gets only some questions right within a group—weaker stability and robustness. Non-linear scoring thus better reflects true capability and reveals model robustness.

2. 不同模型间对比：「Non-Lin Score / Avg Acc」的比值可用于衡量模型从「单题正确」到「组内稳定正确」的折损程度，从而反映模型鲁棒性差异。例如，Gemini-3-Pro的比值约为75%，Doubao-Seed-2.0-Pro-260215约为72%，而 InternVL3-5-241B-A28B-Instruct 约为56%，一些小模型如 LLaVA-Video-7B 仅为约40%。比值越低，说明模型越容易出现「组内只能答对部分题」的现象，稳定性与鲁棒性越弱。由此可见非线性打分在真实刻画能力水平、揭示模型鲁棒性方面的优势。

Avg Acc vs. Non-Lin Score — selected models
Model模型	Avg Acc (%)Avg Acc (%)	Non-Lin Score (group-level metric)Non-Lin Score（组级别指标）	Non-Lin Score / Avg AccNon-Lin Score / Avg Acc
Gemini-3-Pro	66.1%	49.4%	~75%
Gemini-3-Flash	61.1%	42.5%	~70%
Doubao-Seed-2.0-Pro-260215	60.5%	43.3%	~72%
Qwen3.5-397B-A17B-Think (512)	55.9%	39.1%	~70%
MiMo-v2-Omni	56.1%	38.6%	~69%
Qwen3.5-397B-A17B-Think (64)	48.9%	30.6%	~63%
InternVL3-5-241B-A28B-Instruct	41.4%	23.1%	~56%
LLaVA-Video-7B	24.0%	9.7%	~40%

📉 Capability Consistency and Reasoning Coherence Analysis能力一致性与推理连贯性分析

1. Overall Q1→Q4 accuracy trend: We report overall accuracy from Q1 to Q4 for five models in both Capability Consistency and Reasoning Coherence question groups, and analyze from both data and model perspectives.

1. Q1→Q4的整体准确率趋势：我们统计了5个模型在「能力一致性」与「推理连贯性」两类问题组中 Q1→Q4 的整体准确率，并从数据和模型两个视角进行分析。

（1）Data perspective: In Capability Consistency groups, accuracy across Q1–Q4 is similar for all models, indicating that difficulty is well balanced across question indices. In Reasoning Coherence groups, accuracy consistently decreases from Q1 to Q4 for all models, indicating that difficulty increases along the sequence—consistent with our design.

（1）数据角度：在「能力一致性」组中，各模型在Q1→Q4上的准确率整体接近，说明不同编号题目的总体难度分布较为均衡。而在「推理连贯性」组中，所有模型的准确率均呈现 Q1→Q4 逐步下降的趋势，表明组内题目难度按序递增。这符合我们的原始设定。

（2）Model perspective: In Capability Consistency groups, Gemini-3-Pro and GPT-5 exhibit only marginal fluctuations in accuracy from Q1 to Q4, indicating stronger stability. In Reasoning Coherence groups, stronger models exhibit a smooth decline in accuracy from Q1 to Q4 as question difficulty increases, whereas weaker models show more irregular patterns. One possible explanation is that stronger models are more sensitive to incremental changes in question difficulty, resulting in a more uniform degradation as reasoning depth increases. In contrast, weaker models tend to exhibit higher stochasticity, leading to unstable performance across progressively harder questions..

（2）模型角度：在「能力一致性」组中，Gemini-3-Pro 与 GPT-5 的 Q1→Q4 准确率波动很小，体现更强的稳定性。在「推理连贯性」组中，随着问题难度的增加，较强的模型从Q1到Q4的准确性平稳下降，而较弱的模型则显示出不规则的变化。一种可能的解释是，更强的模型对问题难度的增量变化更敏感，从而导致推理深度增加时更均匀的退化。相反，较弱的模型往往表现出更高的随机性，导致在越来越难的问题上表现不稳定。

2. Mean and variance in Capability Consistency groups: We further report mean and variance of overall Q1–Q4 accuracy for eight models in Capability Consistency groups. As shown in the rightmost plot, the horizontal and vertical axes represent average performance and result stability respectively, jointly characterizing both performance and robustness of video understanding. We have the following observations:

2. 「能力一致性」组的均值与方差：我们进一步统计了8个模型在「能力一致性」组上 Q1→Q4 整体准确率的均值与方差。如最右侧图所示，横轴与纵轴分别对应模型的平均表现与结果稳定性，从而同时刻画视频理解能力的性能与鲁棒性。主要观察到两点结论：

（1）Gemini-3-Pro achieves the highest mean accuracy, indicating the strongest overall performance. At the same time, it exhibits the smallest variance, demonstrating the best stability. GPT-5 and Kimi-K2.5 follow closely in terms of stability, also showing strong robustness.
（2）Overall, commercialization models generally outperform open-source models, yet all models still remain substantially below human performance, indicating a significant gap to close.

（1）Gemini-3-Pro 取得了最高的平均准确率，整体性能最强；同时，其方差最小，表现出最佳稳定性。GPT-5 与 Kimi-K2.5 在稳定性方面紧随其后，同样展现出较强鲁棒性。
（2）商业化模型通常优于开源模型，但所有模型仍然远远低于人类的表现，这表明当前模型仍然有很大的提升空间。

Q1-Q4 accuracy and mean/variance in capability consistency — Figure 6: Overall Q1→Q4 accuracy in Capability Consistency and Reasoning Coherence groups, and mean & variance in the Capability Consistency group.

🧠 Effect of Thinking Mode on Video-MME-v2Thinking模式在Video-MME-v2上的作用

We compare the performance changes of instruction-tuned baseline models after enabling the Thinking mode, under both under with- and without-subtitle conditions. For Gemini-3-Flash, the comparison is between Minimal_Thinking and the standard Thinking configuration, both at 1fps.

我们在Video-MME-v2上对比了无/有字幕两种设置下，更强推理配置对模型性能的影响。图中展示了每个模型的Instruct基线，以及切换到更强Reasoning Mode 后带来的增益（Gain）与倒退（Regression）。对于Gemini-3-Flash，由于模型限制，对比的是同为1fps条件下的Minimal_Thinking与标准Thinking配置。

1. Text modality helps unlock reasoning: Overall, enabling Thinking with subtitles tends to yield more stable gains, while without subtitles the benefit is often smaller or can even turn negative. For example, Qwen3.5-122B-A10B gains +3.8 with no subtitle and +5.8 with subtitle on overall score. This suggests that explicit semantic cues from text make it easier for the model’s Thinking ability to be fully utilized.

1. 文本模态有助于激发推理能力：从整体趋势来看，同一模型在「有字幕」条件下开启Thinking往往获得更稳定的增益，而在「无字幕」时收益通常更弱，甚至可能转为负增益。以Qwen3.5-122B-A10B为例，在无/有字幕设定下，Thinking带来的整体提升为+3.8/+5.8。这一现象表明，文本模态提供的显式语义线索更容易促使模型的Thinking能力得到充分发挥。

2. Current Thinking mode can also cause regression: Besides the general pattern that subtitles help Thinking, we still observe clear regressions for some settings, especially without subtitles. For example, Qwen3-VL-8B shows -0.6 without subtitle on overall score, while KimiVL-16B drops by -3.3/-3.3 (without/with subtitle), and on Level 3, where Thinking matters most, it further drops by -4.0/-3.9. This shows that the current Thinking mechanism in video MLLMs does not always bring positive benefit on video understanding tasks and still has substantial room for improvement.

2. 当前Thinking模式可能导致性能倒退：除了「字幕更利于 Thinking」的总体规律外，我们仍然观察到部分设定在开启Thinking 后出现明显回退，且在「无字幕」条件下更常见。例如Qwen3-VL-8B在无字幕设定下整体分数为-0.6，KimiVL-16B在无/有字幕设定下整体分别下降-3.3/-3.3；在最需要Thinking能力的Level 3上，甚至进一步下降到-4.0/-3.9。这说明当前多模态大模型的Thinking机制在视频理解任务中并非总能带来正向收益，仍有较大提升空间。

Effect of Thinking with and without subtitle by level and overall — Figure 7: Score by level and overall under Thinking mode (with/without subtitle)

🧠 Overall Model Performance Analysis on Video-MME-v2Video-MME-v2 上对整体模型性能分析

Around the three-level task framework of Video-MME-v2, we abstract three key underlying capabilities: omni-modal information aggregation (C1), long-range temporal / long-context understanding (C2), and complex reasoning (C3). Based on these, we profile and group existing models and compare their scores.

围绕Video-MME-v2提出的三层任务体系，我们进一步抽象出支撑各层任务的三类关键基础能力：全模态信息聚合（C1）、长程时序/长上下文理解（C2）与复杂推理能力（C3）。基于这三项能力，我们对现有模型进行能力画像与分组，并对比其得分表现。

Model Capability Profiles and Scores

模型能力画像与得分

Model Name模型名称	Non-Lin Score (w. sub)Non-Lin Score（有字幕）	Capabilities能力
Gemini-3-Pro	49.4	c1c2c3
Gemini-3-Flash	42.5	c1c2c3
Qwen3.5-397B-A17B-Think (512)	39.1	c2c3
MiMo-v2-Omni	38.6	c1c2c3
Qwen3.5-397B-A17B-Think (64)	30.6	c2c3
Qwen3-VL-235B-A22B-Think	28.1	c2c3
Qwen3-Omni-30B-A3B-Think	19.5	c1c2c3
Qwen3-Omni-30B-A3B-Instruct	17.1	c1c2

Capability Legend:能力图例：

c1 Omni-modal (Omni-modal information aggregation)全模态信息聚合
c2 Long-context (Ability to process extended inputs)长程时序/长上下文
c3 Thinking (Complex reasoning)复杂推理

1. Synergy of core capabilities: Scores tend to correlate with how complete the capability profile is: models with C1+C2+C3 together generally perform better. For example, Gemini-3-Pro has a relatively complete profile and scores 49.4; Gemini-3-Flash follows with 42.5. This suggests that in complex video understanding, the synergy of omni-modal perception, long-horizon temporal modeling, and deep reasoning is an important factor for overall performance.

1. 核心能力协同效应：从评测结果可以观察到，模型得分与其核心能力的「完整度」存在一定相关性：同时具备C1+C2+C3的模型整体上更具优势。例如，Gemini-3-Pro具备较完整的能力矩阵，得分49.4；Gemini-3-Flash紧随其后，得分42.5。这一现象表明，在复杂视频理解任务中，全模态感知、长时序建模与深度推理的协同可能是提升整体表现的重要因素。

2. Model scale and capability compensation: Besides capability combination, results show that scale has a significant effect on base performance: larger parameter count can partly compensate for missing capabilities. For example, Qwen3.5-397B-A17B-Think mainly has long-context ability (C2) and complex reasoning (C3), yet reaches 39.1—higher than MiMo-v2-Omni (38.6), which has omni-modal capability (C1) and complex reasoning (C3). This shows that when scale increases substantially, the model’s overall capability can partly offset the impact of missing individual capabilities on the score.

2. 模型规模与能力代偿：除能力融合外，结果也显示模型规模对基础性能具有显著影响：更大的参数规模可能部分弥补特定能力的缺失。以Qwen3.5-397B-A17B-Think为例，其主要具备长上下文能力（C2）和复杂推理（C3），但未显式具备全模态能力（C1），仍取得 39.1 分。这一成绩高于具备全模态能力（C1）和复杂推理（C3）的MiMo-v2-Omni（38.6 分）。该显示出当参数规模显著提升时，模型的综合能力表现可能在一定程度上抵消单项能力缺口对得分带来的影响。

3. Impact of frame count on performance: For the same model, increasing frame count can significantly improve performance. For example, Qwen3.5-397B-A17B-Think with 512 frames scores 39.1, while with 64 frames it scores only 30.6—an 8.5-point improvement. This highlights the importance of long-context processing capability (C2) for complex video understanding tasks.

3. 帧数对性能的影响：对于同一模型，增加处理的帧数可以显著提升性能。例如，Qwen3.5-397B-A17B-Think在512帧设置下得分39.1，而在64帧设置下仅得分30.6，提升了8.5分。这突显了长上下文处理能力（C2）对复杂视频理解任务的重要性。

🎯 Capability Radar能力雷达图

We compare selected models on the capability dimensions defined by Video-MME-v2. From the radar chart, three main observations can be drawn:

基于Video-MME-v2定义的能力项，我们对部分模型的表现进行了对比。如雷达图所示，可以归纳出三点主要观察：

1. Significant gain from audio: On the Frames & Audio dimension, Gemini-3-Pro shows a relatively high peak, indicating stronger cross-modal alignment and integration when processing synchronized visual and audio information. In contrast, models that rely more on visual frames (e.g. GPT-5 and the Qwen family) are relatively weaker, reflecting differences in deep multimodal fusion.

1. 音频带来的显著增益：在Frames&Audio维度上，Gemini-3-Pro呈现出相对高峰，表明其在处理视觉与音频的同步信息时具备更强的跨模态对齐与整合能力。相比之下，更依赖视觉帧的模型（如GPT-5和Qwen系列）表现相对较弱，反映出不同模型在多模态深度融合上的差异。

2. Long-horizon temporal reasoning advantage: On capabilities such as Order and Video-Based Knowledge Acquisition, which rely on long-horizon temporal modeling and cross-segment reasoning over long video frames, Gemini-3-Pro also maintains a large lead, indicating more robust long-context and temporal modeling, and better ability to integrate and reason over information across segments in long videos.

2. 长程时序推理优势：在Order与Video-Based Knowledge Acquisition等需要基于长视频帧进行时序建模与跨片段推理的能力项上，Gemini-3-Pro同样保持较大领先幅度，表明其在长上下文处理和时序建模机制上更具鲁棒性，更能应对长视频帧中的跨片段信息整合与推理需求。

3. Clear room for improvement: Overall, even as a SOTA model, Gemini-3-Pro still has significant room for improvement on each dimension. In particular, on Action & Motion and Physical World Reasoning, scores remain below 30, reflecting that current models still need to strengthen fine-grained action semantics and physical-world reasoning.

3. 提升空间明显：整体来看，即便作为SOTA模型，Gemini-3-Pro在各项能力上仍有显著提升空间。尤其是在Action&Motion与Physical World Reasoning等维度上得分仍不足30，反映出当前模型在细粒度动作语义建模与物理规律理解等方面仍需进一步加强。

Level 1: Retrieval & Aggregation Level 1：检索与聚合

Level 2: Temporal Understanding Level 2：时序理解

Level 3: Complex Reasoning Level 3：复杂推理

Capability radar: model performance across Video-MME-v2 dimensions

Figure 8: Capability radar (Click on the model names in the legend to show/hide specific models)

Figure 8: 能力雷达图（点击图例中的模型名称即可显示/隐藏对应的模型结果）

Citation

六、引用

@article{videommev2_2026,
  title={Video-MME-v2: Evaluating True Understanding and Reasoning in Video MLLMs},
  author={Video-MME Team},
  journal={arXiv preprint},
  year={2026},
  url={https://github.com/Video-MME/Video-MME-v2}
}

Video-MME v2

Introduction

一、介绍

1. Progressive Multi-Level Evaluation Dimensions

Multi-Point Information Aggregation

Temporal Understanding

Temporal Complex Reasoning

2. Grouped Non-Linear Evaluation Mechanism

Capability Consistency

Reasoning Coherence

3. Data Annotation and Quality Control

Data Annotation

Data Quality Control

1. Progressive Multi-Level评估维度

多点信息聚合

时序信息理解

时序复杂推理

2. Grouped Non-Linear评估机制

能力一致性

推理连贯性

3. 数据标注与质检

数据标注

数据质检

Leaderboard

二、排行榜

Dataset Examples

三、数据集示例

Level 1: Information Retrieval & Aggregation

Level 1: 信息检索与聚合

Level 2: Temporal Understanding

Level 2: 时序理解

Level 3: Complex Reasoning

Level 3: 复杂推理

Dataset and Annotation

四、数据集与标注

📊 Three-Level Capability Hierarchy

Level 1 — Retrieval & Aggregation

Level 2 — Level 1 + Temporal Understanding

Level 3 — Level 2 + Complex Reasoning

🔧 Annotation Pipeline

📈 Data Statistics

📊 三级能力层级

Level 1 — 信息检索与聚合

Level 2 — Level 1 + 时序

Level 3 — Level 2 + 复杂推理

🔧 标注流程

📈 数据统计

Experiments & Analysis

五、实验与分析

📊 Advantage of Non-Linear Scoring非线性打分优势

📉 Capability Consistency and Reasoning Coherence Analysis能力一致性与推理连贯性分析

🧠 Effect of Thinking Mode on Video-MME-v2Thinking模式在Video-MME-v2上的作用

🧠 Overall Model Performance Analysis on Video-MME-v2Video-MME-v2 上对整体模型性能分析

Model Capability Profiles and Scores

模型能力画像与得分

🎯 Capability Radar能力雷达图

Citation

六、引用