I have a few questions about the format for 3D grounding: Camera Intrinsics: Are camera intrinsics not required in the prompt? Does the model infer them? This is a different approach from Seed-VL, ...