See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding
Paper • 2605.18018 • Published • 32
We advance the development of AGI and foster open source collaboration towards a smarter future.
MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation
See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding