publications
Sorted by year.
2026
- Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual GroundingIn ICML , 2026
- High-Fidelity Text-to-Image Generation from Pre-Trained Vision-Language Models via Distribution-Conditioned Diffusion DecodingarXiv preprint arXiv:2603.13389, 2026
- A Comprehensive Ecosystem for Open-Domain Customized Video GenerationIn ICASSP , 2026
2025
- PACR: Progressively Ascending Confidence Reward for LLM ReasoningarXiv preprint arXiv:2510.22255, 2025
- InfoAgent: Advancing Autonomous Information-Seeking AgentsarXiv preprint arXiv:2509.25189, 2025
- StableAnimator++: Overcoming Pose Misalignment and Face Distortion for Human Image AnimationarXiv preprint arXiv:2507.15064, 2025
- MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory GuidanceIn ICCV , 2025
- AID: Adapting Image2Video Diffusion Models for Instruction-Guided Video PredictionIn ICCV , 2025
- ReasonGen-R1: CoT for Autoregressive Image Generation Models through SFT and RLarXiv preprint arXiv:2505.24875, 2025
- UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain RetrievalIn WACV , 2025
2024
- Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and AlgorithmsIn NeurIPS , 2024
- The Role of ViT Design and Training in Robustness Towards Common CorruptionsIEEE Transactions on Multimedia, 2024
2023
- Parallel sentence-level explanation generation for real-world low-resource scenariosIn ICASSP , 2023
2022
2021
- Cross-Modal Attention Consistency for Video-Audio Unsupervised LearningarXiv preprint arXiv:2106.06939, 2021
2020
2019
- Improving the Learning of Multi-Column Convolutional Neural Network for Crowd CountingIn ACM Multimedia , 2019
2018
2016
2015
- Fudan-Huawei at MediaEval 2015: Detecting Violent Scenes and Affective Impact in Movies with Deep LearningIn MediaEval , 2015
2014
- Fudan-NJUST at MediaEval 2014: Violent Scenes Detection Using Deep Neural NetworksIn MediaEval , 2014
- Challenge Huawei challenge: Fusing Multimodal Features with Deep Neural Networks for Mobile Video AnnotationIn ICMEW , 2014
2013
- Beauty is here: Evaluating Aesthetics in Videos using Multimodal Features and Free Training DataIn ACM Multimedia , 2013
- Fudan at MediaEval 2013: Violent Scenes Detection Using Motion Features and Part-Level AttributesIn MediaEval , 2013
2012
- A Fast Video Event Recognition System and its Application to Video SearchIn ACM Multimedia (Demo) , 2012
- The Shanghai-Hongkong team at MediaEval2012: Violent Scene Detection using Trajectory-based FeaturesIn MediaEval , 2012