Abstract: The advancement of multi-modal large language models (MLLMs) has significantly enhanced their capability to process and understand diverse data types, integrating text, images, and other ...
TL;DR: Our Describe Anything Model (DAM) takes in a region of an image or a video in the form of points/boxes/scribbles/masks and outputs detailed descriptions to the ...