PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction

1Peking University 2Mohamed bin Zayed University of Artificial Intelligence 3National University of Singapore 4University of North Carolina at Chapel Hill 5University of Science and Technology of China
6Manifold.AI 7Cornell University 8Hong Kong Polytechnic University 9Westlake University 10City University of Hong Kong
*Equal contribution Corresponding author
PhysicsMind Dataset

PhysicsMind covers three canonical mechanics scenarios: Center of Mass, Lever Equilibrium, and Newton's First Law. Each is realized with diverse real tabletop experiments and controlled 2D simulations.

Abstract

Modern foundational Multimodal Large Language Models (MLLMs) and video world models have advanced significantly in mathematical, common-sense, and visual reasoning, but their grasp of the underlying physics remains underexplored. Existing benchmarks attempting to measure this matter rely on synthetic Visual Question Answer templates or focus on perceptual video quality that is tangential to measuring how well the video abides by physical laws.

To address this fragmentation, we introduce PhysicsMind, a unified benchmark with both real and simulation environments evaluating both physical reasoning and physically plausible generation of three canonical physical laws: center of mass, lever equilibrium, and Newton's first law.

PhysicsMind comprises two main tasks:

  • VQA tasks: Testing whether models can reason and determine physical quantities and values from images or short videos.
  • Video Generation (VG) tasks: Evaluating if predicted motion trajectories obey the same center-of-mass, torque, and inertial constraints as the ground truth.

A broad range of recent models is evaluated on PhysicsMind and found to rely on appearance heuristics while often violating basic mechanics. These gaps indicate that current scaling and training are still insufficient for robust physical understanding.

Framework & Analysis

PhysicsMind Overview

Figure 1. The PhysicsMind framework combines foundational models with physics-guided dataset construction, expert-verified annotations, and diverse controlled scenarios.


Canonical Mechanics Scenarios

Figure 2. Three canonical mechanics scenarios: Center of Mass, Lever Equilibrium, and Newton's First Law, tested in Real and Sim environments.

Video Generation Analysis

Qualitative comparisons of Video Generation models against Ground Truth.

Hanging Method

[cite_start]Comparing temporal stability and physical realism. [cite: 2313-2315]

Rapid Paper Pull

[cite_start]Evaluating temporal consistency and contact. [cite: 2322-2324]

Lever Equilibrium

[cite_start]Checking if the final state matches torque balance. [cite: 1902-1903]

Experimental Results

1. VQA Physics Evaluation

Model Center of Mass (CoM) Lever Equilibrium (LE) Newton's First Law (NI)
Position Rotation Overall Equilibrium Bal. Adj. Overall Obj. Pos. Stability Overall
GPT-5 [30] 60.0080.0070.00 66.6776.1970.00 60.0095.0077.50
o4-mini [29] 50.0055.0052.50 28.5785.7152.50 75.0095.0085.00
GPT-4o [27] 40.0035.0037.50 42.8654.7647.50 45.0085.0065.00
Claude 4.5 Sonnet [3] 45.0020.0032.50 61.9061.9059.52 40.0090.0065.00
Gemini 2.5 Pro [9] 30.0070.0050.00 61.9052.3857.14 20.0045.0032.50
Qwen2.5-VL-72B [36] 30.0055.0042.50 54.7652.3852.38 42.8685.0062.50

Table 1. Selected VQA Physics Evaluation Results. Values represent accuracy in percentage (%).

2. Video Generation Physics Evaluation

Model Center of Mass Lever Eq. Newton's First Law
Mask IoU ↑ Center Diff ↓ Final Acc (%) ↑ Traj. RMSE ↓ Dir. Consistency ↑
Veo3.1 [13] 0.019108.3935.0 1.3840.5419
Sora-2 [31] 0.167121.4240.0 0.3800.5494
LTX-Video [14] 0.00576.374.76 0.4060.5594
CogVideoX1.5 [42] 0.014223.7038.1 0.4140.4884
Pyramid Flow [17] 0.012322.9747.6 0.3810.6437
Cosmos-predict2 [24] 0.009217.3342.9 0.3500.4884

Table 2. Video generation physics evaluation. Metrics include Segmentation Mask IoU (higher is better), Center Difference (lower is better), and Trajectory RMSE (lower is better).

BibTeX

@article{PhysicsMind2026,
  title={PhysicsMind: Sim and Real Mechanics Benchmarking for Physical Reasoning and Prediction in Foundational VLMs and World Models},
  author={Mak, Chak-Wing and Zhu, Guanyu and Zhang, Boyi and Li, Hongji and Chi, Xiaowei and Zhang, Kevin and Wu, Yichen and He, Yangfan and Fan, Chun-Kai and Lu, Wentao and Ge, Kuangzhi and Fang, Xinyu and He, Hongyang and Lu, Kuan and Xu, Tianxiang and Zhang, Li and Ni, Yongxin and Li, Youhua and Zhang, Shanghang},
  journal={Under Review},
  year={2026}
}