TR2026-045

SPINBENCH: PERSPECTIVE AND ROTATION AS A LENS ON SPATIAL REASONING IN VLMS


    •  Zhang, Y., Corcodel, R., Hori, C., Cherian, A., Zhao, D., "SpinBench: 3D Rotation as a Lens on Spatial Reasoning in VLMs", International Conference on Learning Representations (ICLR) 2026, April 2026.
      BibTeX TR2026-045 PDF
      • @inproceedings{Zhang2026apr2,
      • author = {Zhang, Yuyou and Corcodel, Radu and Hori, Chiori and Cherian, Anoop and Zhao, Ding},
      • title = {{SpinBench: 3D Rotation as a Lens on Spatial Reasoning in VLMs}},
      • booktitle = {International Conference on Learning Representations (ICLR) 2026},
      • year = 2026,
      • month = apr,
      • url = {https://www.merl.com/publications/TR2026-045}
      • }
  • MERL Contacts:
  • Research Area:

    Robotics

Abstract:

We present SPINBENCH, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SPINBENCH is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint trans- formation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SPINBENCH introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 43 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SPIN- BENCH captures spatial reasoning challenges shared across humans and VLMs. Together, our findings highlight the need for structured, cognitively inspired diagnostic tools to advance spatial reasoning in multimodal foundation models.

 

  • Related Publication

  •  Zhang, Y., Corcodel, R., Hori, C., Cherian, A., Zhao, D., "SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs", arXiv, September 2025.
    BibTeX arXiv
    • @article{Zhang2025sep3,
    • author = {Zhang, Yuyou and Corcodel, Radu and Hori, Chiori and Cherian, Anoop and Zhao, Ding},
    • title = {{SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs}},
    • journal = {arXiv},
    • year = 2025,
    • month = sep,
    • url = {https://arxiv.org/abs/2509.25390}
    • }