Great work!
Since the benchmark was included by Qwen3-VL and Gemini Robotics, it is attracting more attention.
But I can not re-produce the Qwen3-VL score, since I dont know the setting (like the structure of the prompt). I am pretty sure you gays know what I'm saying.
Any clues of how to reproduce their results?