I have a few questions about the format for 3D grounding: Camera Intrinsics: Are camera intrinsics not required in the prompt? Does the model infer them? This is a different approach from Seed-VL, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results