I am a final-year Ph.D. student in the Multi-Media Lab at The Chinese University of Hong Kong, supervised by Prof. Dahua Lin. My research focuses on vision-language models, video grounding, and long-horizon agent evaluation.
I am currently a Research Scientist Intern at Meta Superintelligence Labs, Segment Anything Team, returning for my second internship with the team and working on video grounding with Jie Lei. Previously, I worked on SAM 3 with Nicolas Carion at Meta and on multimodal LLMs with Jiaqi Wang at Shanghai AI Laboratory.
I expect to graduate in Summer 2027 and am actively seeking Research Scientist opportunities in multimodal AI, video VLMs, and agent evaluation.
[May. 2026] Released WildClawBench, a real-world long-horizon agent benchmark with 60 human-authored multimodal tasks.
[May. 2026]SetCon is online, introducing set-level concept prediction for open-ended referring segmentation.
[Jan. 2026] Three papers have been accepted at ICLR 2026.
[Nov. 2025] We released SAM 3, a unified model for detection, segmentation, and tracking of objects in images and video using text, exemplar, and visual prompts.
[Aug. 2025] Invited talk at LSVOS workshop at ICCV 2025.
A real-world, long-horizon agent benchmark with 60 human-authored multimodal tasks across productivity, coding, search, social interaction, creative synthesis, and safety.
SAM 3: Segment Anything with Concepts
Nicolas Carion*, Laura Gustafson*, Yuan-Ting Hu*, Shoubhik Debnath*, Ronghang Hu*, Didac Suris*, Chaitanya Ryali*, Kalyan Vasudev Alwala*, Haitham Khedr*, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu°, Tsung-Han Wu°, Yu Zhou°, Liliane Momeni°, Rishi Hazra°, Shuangrui Ding°, Sagar Vaze°, Francois Porcher°, Feng Li°, Siyuan Li°, Aishwarya Kamath°, Ho Kei Cheng°, Piotr Dollar†, Nikhila Ravi†, Kate Saenko†, Pengchuan Zhang†, Christoph Feichtenhofer†
ICLR, 2026
Project Page /
paper /
blog /
demo /
code /
HF🤗
Unify detection, segmentation, and tracking of any concept in images and video using text and exemplar prompts.
SetCon: Towards Open-Ended Referring Segmentation via Set-Level Concept Prediction Zhixiong Zhang,
Yizhuo Li,
Shuangrui Ding†,
Yuhang Zang,
Shengyuan Ding,
Long Xing,
Yibin Wang,
Qiaosheng Zhang,
Jiaqi Wang arXiv, 2026
arXiv /
code /
HF model /
HF data
Reformulate open-ended referring segmentation as hierarchical set-level concept prediction, enabling interpretable mask-set decoding for image and video grounding.
Outperform SAM 2 by a large margin through a training-free memory tree.
SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation Shuangrui Ding*, Zihan Liu*, Xiaoyi Dong, Pan Zhang, Rui Qian, Junhao Huang, Conghui He,
Dahua Lin,
Jiaqi Wang ACL main, 2025
arXiv
/
code
/
invited talk
/
demo page
A language large model that understands and generates melodies and lyrics in symbolic song representations.
Keynote Speaker, LSVOS Workshop at ICCV 2025, Honolulu, Hawaii. Talk: From Pixels to Meaning: Towards Reliable Video Object Segmentation across Frames.
Invited Speaker, InternLM Community Open Mic, Mar. 2024. Talk: When Songwriting Meets Large Models: SongComposer as an AI Musician.
Misc
1. My favorite sports is soccer. I was the captain of UM-SJTU JI soccer team during season 2018. Besides, I am a super fan of Manchester City in Premier League.
2. I am proud to have graduated from the competition class at Hangzhou No.2 High School, where I made friends with many talented students and inspiring teachers.
3. It is worth mentioning that Rui is my best friend and has motivated me forward for over ten years as my role model. Best wishes and good luck!