publications

Sorted by year.

2026

  1. Covering Human Action Space for Computer Use: Data Synthesis and Benchmark
    Miaosen Zhang, Xiaohan Zhao, Zhihong Tan, Huoshen Zhou, Yijia Fan, Yifan Yang, and 11 more authors
    arXiv preprint arXiv:2605.12501, 2026
  2. Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training
    Miaosen Zhang, Yishan Liu, Shuxia Lin, Xu Yang, Qi Dai, Chong Luo, and 5 more authors
    In ICML , 2026
  3. RE-TRAC: Recursive Trajectory Compression for Deep Search Agents
    Jialiang Zhu, Gongrui Zhang, Xiaolong Ma, Lin Xu, Miaosen Zhang, Ruiqi Yang, and 14 more authors
    In ICML , 2026
  4. Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding
    Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark A Hasegawa-Johnson, and 3 more authors
    In ICML , 2026
  5. AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
    Ziwei Zhou, Zeyuan Lai , Rui Wang, Yifan Yang, Zhen Xing, Yuqing Yang, and 3 more authors
    In ICML , 2026
  6. Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning
    Chendong Wang, Donglin Bai, Yifan Yang, Xiao Jin, Anlan Zhang , Rui Wang, and 8 more authors
    In ICML , 2026
  7. ArcFlow: Unleashing 2-Step Text-to-Image Generation via High-Precision Non-Linear Flow Distillation
    Zihan Yang, Shuyuan Tu, Licheng Zhang, Qi DaiYu-Gang Jiang, and Zuxuan Wu
    arXiv preprint arXiv:2602.09014, 2026
  8. High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion Decoding
    Ji Woo Hong, Hee Suk Yoon, Gwanhyeong Koo, Eunseop Yoon, SooHwan Eom, Qi Dai, and 2 more authors
    arXiv preprint arXiv:2603.13389, 2026
  9. Language-Conditioned World Modeling for Visual Navigation
    Yifei Dong , Fengyi Wu, Yilong Dai, Lingdong Kong, Guangyu Chen, Xu Zhu, and 7 more authors
    arXiv preprint arXiv:2603.26741, 2026
  10. Semantics Lead the Way: Harmonizing Semantic and Texture Modeling with Asynchronous Latent Diffusion
    Yueming Pan, Ruoyu Feng, Qi Dai , Yuqi Wang, Wenfeng Lin, Mingyu Guo, and 2 more authors
    In CVPR , 2026
  11. FlashPortrait: 6x Faster Infinite Portrait Animation with Adaptive Latent Prediction
    Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, and 3 more authors
    In CVPR , 2026
  12. FlashMotion: Few-Step Controllable Video Generation with Trajectory Guidance
    Quanhao Li, Zhen Xing , Rui Wang, Haidong Cao, Qi Dai, Daoguo Dong, and 1 more author
    In CVPR , 2026
  13. DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
    Wonjoon Jin, Jiyun Won, Janghyeok Han, Qi Dai, Chong Luo, Seung-Hwan Baek, and 1 more author
    In CVPR , 2026
  14. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
    Hee Suk Yoon, Eunseop Yoon, Ji Woo Hong, SooHwan Eom, Gwanhyeong Koo, Mark A Hasegawa-Johnson, and 3 more authors
    In CVPR , 2026
  15. GoViG: Goal-Conditioned Visual Navigation Instruction Generation via Multimodal Reasoning
    Fengyi Wu, Yifei Dong, Yilong Dai, Guangyu Chen , Qifeng Wu, Huiting Huang, and 4 more authors
    In ACL Findings , 2026
  16. MageBench: Bridging Large Multimodal Models to Agents
    Miaosen Zhang, Qi Dai, Yifan Yang, Jianmin Bao, Dongdong Chen, Kai Qiu, and 3 more authors
    In WACV , 2026
  17. A Comprehensive Ecosystem for Open-Domain Customized Video Generation
    Jingxu Zhang, Yuqian Hong, Daneul Kim, Kai Qiu, Qi Dai, Jianmin Bao, and 3 more authors
    In ICASSP , 2026
  18. LLM2CLIP: Powerful Language Model Unlocks Richer Cross-Modality Representation
    Weiquan Huang , Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Usman Naseem, and 7 more authors
    In AAAI , 2026
  19. Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, and 4 more authors
    In AAAI , 2026

2025

  1. PACR: Progressively Ascending Confidence Reward for LLM Reasoning
    Eunseop Yoon, Hee Suk Yoon, Jaehyun Jang, SooHwan Eom, Qi Dai, Chong Luo, and 2 more authors
    arXiv preprint arXiv:2510.22255, 2025
  2. InfoAgent: Advancing Autonomous Information-Seeking Agents
    Gongrui Zhang, Jialiang Zhu, Ruiqi Yang, Kai Qiu, Miaosen Zhang , Zhirong Wu, and 12 more authors
    arXiv preprint arXiv:2509.25189, 2025
  3. Phi-Ground Tech Report: Advancing Perception in GUI Grounding
    Miaosen Zhang, Ziqiang Xu, Jialiang Zhu, Qi Dai, Kai Qiu, Yifan Yang, and 5 more authors
    arXiv preprint arXiv:2507.23779, 2025
  4. StableAvatar: Infinite-Length Audio-Driven Avatar Video Generation
    Shuyuan Tu, Yueming Pan, Yinming Huang, Xintong Han, Zhen Xing, Qi Dai, and 3 more authors
    arXiv preprint arXiv:2508.08248, 2025
  5. StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image Animation
    Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi ChengQi Dai, Chong Luo, and 2 more authors
    arXiv preprint arXiv:2507.15064, 2025
  6. Unified World Models: Memory-Augmented Planning and Foresight for Visual Navigation
    Yifei Dong , Fengyi Wu, Guangyu Chen, Zhi-Qi Cheng , Qiyu Hu, Yuxuan Zhou, and 4 more authors
    arXiv preprint arXiv:2510.08713, 2025
  7. MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance
    Quanhao Li, Zhen Xing , Rui Wang, Hui Zhang, Qi Dai, and Zuxuan Wu
    In ICCV , 2025
  8. JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
    Byung-Ki Kwon, Qi Dai, Hyoseok Lee, Chong Luo, and Tae-Hyun Oh
    In ICCV , 2025
  9. AID: Adapting Image2Video Diffusion Models for Instruction-Guided Video Prediction
    Zhen Xing, Qi Dai, Zejia Weng, Zuxuan Wu, and Yu-Gang Jiang
    In ICCV , 2025
  10. MotionFollower: Editing Video Motion via Score-Guided Diffusion
    Shuyuan Tu, Qi Dai, Zihao Zhang, Sicheng Xie, Zhi-Qi Cheng, Chong Luo, and 3 more authors
    In ICCV , 2025
  11. REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents
    Rui Tian, Qi Dai, Jianmin Bao, Kai Qiu, Yifan Yang, Chong Luo, and 2 more authors
    In ICCV , 2025
  12. ViaRL: Adaptive Temporal Grounding via Visual Iterated Amplification Reinforcement Learning
    Ziqiang Xu, Qi Dai, Tian Xie, Yifan Yang, Kai Qiu, DongDong Chen, and 2 more authors
    arXiv preprint arXiv:2505.15447, 2025
  13. ReasonGen-R1: CoT for Autoregressive Image Generation Models through SFT and RL
    Yu Zhang , Yunqi Li, Yifan Yang , Rui Wang, Yuqing Yang, Qi Dai, and 4 more authors
    arXiv preprint arXiv:2505.24875, 2025
  14. Phi-4-Mini Technical Report: Compact yet powerful multimodal language models via mixture-of-loras
    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, and 68 more authors
    arXiv preprint arXiv:2503.01743, 2025
  15. StableAnimator: High-quality identity-preserving human image animation
    Shuyuan Tu, Zhen Xing, Xintong Han, Zhi-Qi ChengQi Dai, Chong Luo, and 1 more author
    In CVPR , 2025
  16. FloVD: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis
    Wonjoon Jin, Qi Dai, Chong Luo, Seung-Hwan Baek, and Sunghyun Cho
    In CVPR , 2025
  17. HomoGen: Enhanced Video Inpainting via Homography Propagation and Diffusion
    Ding Ding, Yueming Pan, Ruoyu Feng, Qi Dai, Kai Qiu, Jianmin Bao, and 2 more authors
    In CVPR , 2025
  18. Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
    Yifei Dong , Fengyi Wu, Sanjian Zhang, Guangyu Chen , Yuzhi Hu, Masumi Yano, and 5 more authors
    In CVPRW , 2025
  19. FaceA-Net: Facial Attribute-Driven ID Preserving Image Generation Network
    Jiayu Wang, Yue Yu, Jingjing Chen, Qi Dai, and Yu-Gang Jiang
    In AAAI , 2025
  20. UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
    Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, and 3 more authors
    In WACV , 2025

2024

  1. Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
    Miaosen Zhang, Yixuan Wei, Zhen Xing, Yifei Ma, Zuxuan Wu , Ji Li, and 5 more authors
    In NeurIPS , 2024
  2. Human-aware vision-and-language navigation: Bridging simulation to reality with dynamic human interactions
    Heng Li , Minghan Li, Zhi-Qi Cheng, Yifei Dong, Yuxuan Zhou, Jun-Yan He, and 3 more authors
    In NeurIPS , 2024
  3. MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
    Yanhui Wang, Jianmin Bao, Wenming Weng, Ruoyu Feng, Dacheng Yin, Tao Yang, and 9 more authors
    In CVPR , 2024
  4. MotionEditor: Editing Video Motion via Content-Aware Diffusion
    Shuyuan Tu, Qi DaiZhi-Qi ChengHan Hu, Xintong Han, Zuxuan Wu, and 1 more author
    In CVPR , 2024
  5. SimDA: Simple Diffusion Adapter for Efficient Video Generation
    Zhen Xing, Qi DaiHan HuZuxuan Wu, and Yu-Gang Jiang
    In CVPR , 2024
  6. BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition
    Yuxuan Zhou, Xudong Yan, Zhi-Qi Cheng, Yan Yan, Qi Dai, and Xian-Sheng Hua
    In CVPR , 2024
  7. ARTV: Auto-Regressive Text-to-Video Generation with Diffusion Models
    Wenming Weng, Ruoyu Feng , Yanhui Wang, Qi Dai, Wang Chunyu, Dacheng Yin, and 7 more authors
    In CVPRW , 2024
  8. A survey on video diffusion models
    Zhen Xing, Qijun Feng, Haoran Chen, Qi DaiHan Hu, Hang Xu, and 2 more authors
    ACM Computing Surveys, 2024
  9. The Role of ViT Design and Training in Robustness Towards Common Corruptions
    Rui Tian, Zuxuan WuQi Dai, Micah Goldblum, Han Hu, and Yu-Gang Jiang
    IEEE Transactions on Multimedia, 2024

2023

  1. SVFormer: Semi-supervised Video Transformer for Action Recognition
    Zhen Xing, Qi DaiHan Hu, Jingjing Chen, Zuxuan Wu, and Yu-Gang Jiang
    In CVPR , 2023
  2. ResFormer: Scaling ViTs with Multi-Resolution Training
    Rui Tian, Zuxuan WuQi DaiHan Hu, Yu Qiao, and Yu-Gang Jiang
    In CVPR , 2023
  3. On Data Scaling in Masked Image Modeling
    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Yixuan Wei, Qi Dai, and 1 more author
    In CVPR , 2023
  4. HiViT: A simpler and more efficient design of hierarchical vision transformer
    Xiaosong Zhang, Yunjie Tian, Lingxi Xie, Wei Huang, Qi Dai, Qixiang Ye, and 1 more author
    In ICLR , 2023
  5. Implicit Temporal Modeling with Learnable Alignment for Video Recognition
    Shuyuan Tu, Qi DaiZuxuan WuZhi-Qi ChengHan Hu, and Yu-Gang Jiang
    In ICCV , 2023
  6. All in Tokens: Unifying Output Space of Visual Tasks via Soft Token
    Jia Ning , Chen Li, Zheng Zhang , Chunyu Wang, Zigang Geng, Qi Dai, and 2 more authors
    In ICCV , 2023
  7. ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules
    Zhi-Qi ChengQi Dai, and Alexander G Hauptmann
    In ICCV , 2023
  8. VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models
    Zhen Xing, Qi Dai, Zihao Zhang, Hui Zhang, Han HuZuxuan Wu, and 1 more author
    arXiv preprint arXiv:2311.18837, 2023
  9. Parallel sentence-level explanation generation for real-world low-resource scenarios
    Yan Liu, Xiaokang Chen, and Qi Dai
    In ICASSP , 2023
  10. Deep Uncoupled Discrete Hashing via Similarity Matrix Decomposition
    Dayan Wu, Qi Dai , Bo Li , and Weiping Wang
    ACM TOMM, 2023

2022

  1. SimMIM: A Simple Framework for Masked Image Modeling
    Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao , Zhuliang Yao, and 2 more authors
    In CVPR , 2022
  2. Rethinking Spatial Invariance of Convolutional Networks for Object counting
    Zhi-Qi ChengQi Dai , Hong Li, Jingkuan Song , Xiao Wu, and Alexander G Hauptmann
    In CVPR , 2022
  3. On the Connection between Local Attention and Dynamic Depth-Wise Convolution
    Qi Han, Zejia Fan, Qi Dai, Lei Sun , Ming-Ming Cheng, Jiaying Liu, and 1 more author
    ICLR, 2022
  4. GSRFormer: Grounded Situation Recognition Transformer with Alternate Semantic Attention Refinement
    Zhi-Qi ChengQi Dai , Siyao Li, Teruko Mitamura, and Alexander Hauptmann
    In ACM Multimedia , 2022
  5. MPII: Multi-level Mutual Promotion for Inference and Interpretation
    Yan Liu, Sanyuan Chen, Yazheng Yang, and Qi Dai
    In ACL , 2022

2021

  1. Temporal Action Detection with Multi-Level Supervision
    Baifeng Shi, Qi Dai, Judy Hoffman, Kate Saenko, Trevor Darrell, and Huijuan Xu
    In ICCV , 2021
  2. Self-Supervised Learning with Swin Transformers
    Zhenda Xie, Yutong Lin , Zhuliang Yao, Zheng Zhang, Qi Dai, Yue Cao, and 1 more author
    arXiv preprint arXiv:2105.04553, 2021
  3. Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning
    Shaobo Min, Qi Dai, Hongtao Xie, Chuang Gan, Yongdong Zhang, and Jingdong Wang
    arXiv preprint arXiv:2106.06939, 2021
  4. A Novel Class Restriction Loss for Unsupervised Domain Adaptation
    Qi He, Qi Dai , Xiao Wu, and Jun-Yan He
    Neurocomputing, 2021

2020

  1. Informative Dropout for Robust Representation Learning: A Shape-bias Perspective
    Baifeng Shi, Dinghuai Zhang, Qi Dai, Zhanxing Zhu, Yadong Mu, and Jingdong Wang
    In ICML , 2020
  2. Weakly-Supervised Action Localization by Generative Attention Modeling
    Baifeng Shi, Qi Dai, Yadong Mu, and Jingdong Wang
    In CVPR , 2020
  3. Reinforced Short-length Hashing
    Xingbo Liu, Xiushan Nie, Qi Dai, Yupan Huang, Li Lian, and Yilong Yin
    IEEE TCSVT, 2020

2019

  1. Deep Incremental Hashing Network for Efficient Image Retrieval
    Dayan Wu, Qi Dai, Jing Liu , Bo Li , and Weiping Wang
    In CVPR , 2019
  2. Learning Spatial Awareness to Improve Crowd Counting
    Zhi-Qi Cheng , Jun-Xiu Li, Qi Dai , Xiao Wu, and Alexander G Hauptmann
    In ICCV , 2019
  3. Improving the Learning of Multi-Column Convolutional Neural Network for Crowd Counting
    Zhi-Qi Cheng , Jun-Xiu Li, Qi Dai , Xiao Wu, Jun-Yan He, and Alexander G Hauptmann
    In ACM Multimedia , 2019
  4. Decoupling Localization and Classification in Single Shot Temporal Action Detection
    Yupan Huang, Qi Dai, and Yutong Lu
    In ICME , 2019

2018

  1. Recurrent Tubelet Proposal and Recognition Networks for Action Detection
    Dong Li, Zhaofan Qiu, Qi DaiTing Yao, and Tao Mei
    In ECCV , 2018
  2. Deep Domain Adaptation Hashing with Adversarial Learning
    Fuchen Long, Ting YaoQi Dai, Xinmei Tian, Jiebo Luo, and Tao Mei
    In SIGIR , 2018

2016

  1. Binary Optimized Hashing
    In ACM Multimedia , 2016
  2. A Bayesian Hashing Approach and its Application to Face Recognition
    Qi DaiJianguo Li , Jun Wang, Yurong Chen, and Yu-Gang Jiang
    Neurocomputing, 2016

2015

  1. Optimal Bayesian Hashing for Efficient Face Recognition
    Qi DaiJianguo Li , Jun Wang, Yurong Chen, and Yu-Gang Jiang
    In IJCAI , 2015
  2. Human Action Recognition in Unconstrained Videos by Explicit Motion Modeling
    Yu-Gang JiangQi Dai, Wei Liu, Xiangyang Xue, and Chong-Wah Ngo
    IEEE TIP, 2015
  3. Super Fast Event Recognition in Internet Videos
    Yu-Gang JiangQi DaiTao Mei, Yong Rui, and Shih-Fu Chang
    IEEE TMM, 2015
  4. Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep Learning
    Qi Dai, Rui-Wei Zhao, Zuxuan Wu , Xi Wang, Zichen Gu , Wenhai Wu, and 1 more author
    In MediaEval , 2015

2014

  1. Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural Networks
    Qi DaiZuxuan WuYu-Gang Jiang, Xiangyang Xue, and Jinhui Tang
    In MediaEval , 2014
  2. Challenge Huawei challenge: Fusing Multimodal Features with Deep Neural Networks for Mobile Video Annotation
    Jian Tu, Zuxuan WuQi DaiYu-Gang Jiang, and Xiangyang Xue
    In ICMEW , 2014

2013

  1. Beauty is here: Evaluating Aesthetics in Videos using Multimodal Features and Free Training Data
    Yanran Wang, Qi Dai, Rui Feng, and Yu-Gang Jiang
    In ACM Multimedia , 2013
  2. Fudan at MediaEval 2013: Violent Scenes Detection Using Motion Features and Part-Level Attributes
    Qi Dai, Jian Tu, Ziqiang Shi, Yu-Gang Jiang, and Xiangyang Xue
    In MediaEval , 2013

2012

  1. Trajectory-based Modeling of Human Actions with Motion Reference Points
    Yu-Gang JiangQi Dai, Xiangyang Xue, Wei Liu, and Chong-Wah Ngo
    In ECCV , 2012
  2. Fast Semantic Diffusion for Large-scale Context-based Image and Video Annotation
    Yu-Gang JiangQi Dai , Jun Wang, Chong-Wah Ngo, Xiangyang Xue, and Shih-Fu Chang
    IEEE TIP, 2012
  3. A Fast Video Event Recognition System and its Application to Video Search
    Yu-Gang JiangQi Dai, Yingbin Zheng, Xiangyang Xue, Jie Liu , and Dong Wang
    In ACM Multimedia (Demo) , 2012
  4. The Shanghai-Hongkong team at MediaEval2012: Violent Scene Detection using Trajectory-based Features
    Yu-Gang JiangQi Dai, Chun Chet Tan, Xiangyang Xue, and Chong-Wah Ngo
    In MediaEval , 2012