Evaluation

Dear authors,

Could you please provide more details on how you evaluate different benchmarks (CV-Bench, BLINK, RoboSpatial, etc) for different models (Qwen-2.5-VL-7B, SpaceLLaVA, RoboPoint, etc)? I try to reproduce the results in the paper, but find big differences. For example, the results I got for SpaceLLaVA using the official eval codes for RoboSpatial is much lower than what the paper reports. I would really appreciate your help.

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation #16

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluation #16

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions