Yonggang Qi 齐勇刚


Beijing University of Posts and Telecommunications (BUPT), Beijing, China

I am currently an associate professor at BUPT. Previously, I was a PhD student at Pattern Recognition and Intelligent Systems (PRIS) laboratory at BUPT. I received the PhD in Signal Processing from Beijing University of Posts and Telecommunications (BUPT) in 2015 (Supervisor: Professor Jun Guo). From 2019 to 2020, I was a visiting scholar at SketchX lab headed by Dr. Yi-Zhe Song at the Centre for Vision Speech and Signal Processing (CVSSP) in University of Surrey. I also worked as a guest PhD at Aalborg University in Denmark in 2013 and a visiting researcher at Sun Yat-sen University in China in 2014.

My research focuses on computer vision and multimodal learning, with particular interests in human sketch related tasks, image/video generation, and diffusion models. My recent work emphasizes multimodal representation learning, cross-modal alignment, and generative modeling for complex, real-world scenarios.

拟招收博士研究生一名(申请-考核、硕博连读),欢迎带简历邮件联系。

常年招收2-4名硕士研究生(保研+考研)、科研实习生若干名(3-6个月及以上),欢迎有科研热情的同学带简历邮件联系。正在招收2026年入学硕士考研同学,欢迎联系。


News


Selected Publications

*: equal contribution
#: corresponding author

Responsive image
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi#.
Computer Vision and Pattern Recognition (CVPR 2026).
Addresses text-only reasoning lacking spatial grounding in VLN; encodes imagined visual tokens into compact latent representations, enabling efficient reasoning-aware navigation without explicit token generation at inference.
[ Arxiv ]--[ Project Page ]--[ Github Code ]

Responsive image
3DrawAgent: Teaching LLM to Draw in 3D with Early Contrastive Experience
Hongcan Xiao, Xinyue Xiao, Yilin Wang, Yue Zhang, Yonggang Qi#.
Computer Vision and Pattern Recognition (CVPR 2026).
Addresses the lack of 3D spatial drawing ability in LLMs; introduces early contrastive experience to teach LLMs to generate structured, geometry-aware 3D drawings.
[ Arxiv ]--[ Project Page ]--[ Github Code ]

Responsive image
FantasyWorld: Geometry-Consistent World Modeling via Unified Video and 3D Prediction
Yixiang Dai, Fan Jiang, Chiyu Wang, Mu Xu, Yonggang Qi#.
International Conference on Learning Representations (ICLR 2026).
Addresses geometry inconsistency in video generation; augments frozen video foundation models with a trainable geometric branch to jointly model video latents and implicit 3D fields in a single forward pass.
[ Arxiv ]--[ Project Page ]--[ Github Code ]

Responsive image
Precise Diffusion Inversion: Towards Novel Samples and Few-Step Models
Jing Zuo, Luoping Cui, Chuang Zhu, & Yonggang Qi#.
The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025).
Addresses accumulated inversion errors in DDIM; proposes a precise inversion framework that enables novel sample generation and accelerates few-step diffusion models without retraining.
[ Paper ]--[ Github Code ]

Responsive image
Fuse2Match: Training-Free Fusion of Flow, Diffusion, and Contrastive Models for Zero-Shot Semantic Matching
Jing Zuo, Jiaqi Wang, Yonggang Qi#, & Yi-Zhe Song.
The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025).
Addresses zero-shot semantic matching across large appearance gaps; training-free fusion of optical flow, diffusion, and contrastive models yields complementary cross-image correspondences without any fine-tuning.
[ Paper ]--[ Github Code ]

Responsive image
Autoregressive Video Generation without Vector Quantization
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi#, Xinlong Wang#
International Conference on Learning Representations (ICLR 2025).
Addresses the quality bottleneck of discrete VQ tokenization in autoregressive video generation; replaces VQ with continuous token modeling for higher-fidelity, temporally coherent video synthesis.
[ Arxiv ]--[ Github Code ]

Responsive image
VersaGen: Unleashing Versatile Visual Control for Text-to-Image Synthesis
Zhipeng Chen, Lan Yang, Yonggang Qi, Honggang Zhang, Kaiyue Pang, Ke Li, & Yi-Zhe Song.
AAAI Conference on Artificial Intelligence (AAAI 2025). Oral
Addresses limited and rigid visual control in text-to-image synthesis; proposes a unified framework supporting versatile spatial and semantic control signals (sketch, depth, pose, etc.) with a single model.
[ Arxiv ]--[ Github Code ]

Responsive image
SAUGE: Taming SAM for Uncertainty-Aligned Multi-Granularity Edge Detection
Xing Liufu, Chaolei Tan, Xiaotong Lin, Yonggang Qi, Jinxuan Li & Jian-Fang Hu
AAAI Conference on Artificial Intelligence (AAAI 2025).
Addresses the lack of uncertainty awareness in edge detection; tames SAM's multi-granularity segmentation priors to produce uncertainty-aligned, scale-consistent edge maps.

Responsive image
Scale-Adaptive Diffusion Model for Complex Sketch Synthesis
Jijin Hu, Ke Li, Yonggang Qi & Yi-Zhe Song.
International Conference on Learning Representations (ICLR 2024).
Addresses difficulty in synthesizing sketches with complex multi-scale structures; introduces a scale-adaptive diffusion model that dynamically adjusts generation resolution to capture fine-grained spatial details.

Responsive image
Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style
Fengyin Lin*, Mingkang Li*, Da Li, Timothy Hospedales, Yi-Zhe Song, Yonggang Qi
Computer Vision and Pattern Recognition (CVPR 2023). Highlight
Addresses the black-box nature of cross-modal sketch-based retrieval; proposes a zero-shot SBIR framework that simultaneously retrieves images and generates visual explanations of the matching rationale.

Responsive image
SketchKnitter: Vectorized Sketch Generation with Diffusion Models
Qiang Wang, Haoge Deng, Yonggang Qi#, Da Li, Yi-Zhe Song
International Conference on Learning Representations (ICLR 2023). Spotlight
Addresses the lack of temporal coherence in vectorized sketch generation; first to apply diffusion models to sequential vector stroke generation, producing human-like sketches with natural drawing order.

Responsive image
A Diffusion-ReFinement Model for Sketch-to-Point Modeling
Di Kong, Qiang Wang, Yonggang Qi#.
The 16th Asian Conference on Computer Vision (ACCV 2022). Oral
Addresses the domain gap between abstract sketches and 3D point clouds; introduces a diffusion-based refinement model that progressively denoises coarse sketch-driven point predictions into detailed 3D shapes.

Responsive image
DiffSketching: Sketch Control Image Synthesis with Diffusion Models
Qiang Wang, Di Kong, Yonggang Qi#.
The 33rd British Machine Vision Conference (BMVC 2022).
Addresses the difficulty of using sparse sketch inputs to guide realistic image generation; leverages diffusion models' strong generative priors to synthesize photorealistic images conditioned on free-hand sketches.

Responsive image
Generative Sketch Healing
Yonggang Qi, Guoyao Su, Qiang Wang, Jie Yang, Kaiyue Pang and Yi-Zhe Song.
International Journal of Computer Vision (IJCV), Springer.
Addresses the problem of incomplete or corrupted sketches; proposes a generative model that infers and restores missing strokes while preserving the original drawing style and semantic structure.
[ Paper ]--[ Project Page ]

Responsive image
SketchLattice: Latticed Representation for Sketch Manipulation
Yonggang Qi*, Guoyao Su*, Pinaki Nath Chowdhury, Mingkang Li and Yi-Zhe Song.
IEEE International Conference on Computer Vision (ICCV), 2021.
Addresses the rigidity of existing sketch representations; introduces a latticed graph structure over strokes that enables flexible, controllable sketch editing and semantic manipulation.

Responsive image
PQA: Perceptual Question Answering
Yonggang Qi*, Kai Zhang*, Aneeshan Sain and Yi-Zhe Song.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
Addresses the absence of perceptual grouping reasoning in VQA; introduces a new task and dataset requiring models to answer questions based on Gestalt perceptual principles applied to visual scenes.

Responsive image
Towards Fine-Grained Sketch-Based 3D Shape Retrieval
Anran Qi, Yulia Gryaditskaya, Jifei Song, Yongxin Yang, Yonggang Qi, Timothy M. Hospedales, Tao Xiang, Yi-Zhe Song.
IEEE Transactions on Image Processing (TIP), 2021.
Addresses coarse-grained retrieval in sketch-based 3D shape search; proposes a fine-grained cross-modal embedding with a new benchmark to retrieve instance-specific 3D shapes from hand-drawn sketches.

Responsive image
Towards Practical Sketch-based 3D Shape Generation: The Role of Professional Sketches
Yue Zhong, Yonggang Qi, Yulia Gryaditskaya, Honggang Zhang, and Yi-Zhe Song.
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

Responsive image
SketchHealer: A Graph-to-Sequence Network for Recreating Partial Human Sketches
Guoyao Su, Yonggang Qi, Kaiyue Pang, Jie Yang and Yi-Zhe Song.
The 31st British Machine Vision Virtual Conference (BMVC2020)
Oral Presentation, 5% acceptance rate

Responsive image
S3NET: Graph Representational Network For Sketch Recognition
Lan Yang, Aneeshan Sain, Linpeng Li, Yonggang Qi, Honggang Zhang and Yi-Zhe Song.
2020 IEEE International Conference on Multimedia and Expo (ICME2020)

Responsive image
Improved Traffic Sign Detection In Videos Through Reasoning Effective RoI Proposals
Yanting Zhang, Yonggang Qi, Jie Yang and Jenq-Neng Hwang.
2020 IEEE International Conference on Multimedia and Expo (ICME2020)

Responsive image
Sketch Fewer to Recognize More by Learning A Co-regularized Sparse Representation
Yonggang Qi and Yi-Zhe Song.
IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

Responsive image
Unpaired Image-to-Sketch Translation Network for Sketch Synthesis
Yue Zhang, Guoyao Su, Yonggang Qi and Jie Yang.
IEEE Visual Communications and Image Processing (VCIP), 2019.

Responsive image
SketchSegNet+: An End-to-End Learning of RNN for Multi-Class Sketch Semantic Segmentation
Yonggang Qi and Zheng-Hua Tan.
IEEE ACCESS.

Responsive image
Image Retrieval by Dense Caption Reasoning
Xinru Wei, Yonggang Qi, Jun Liu and Fang Liu.
IEEE Visual Communications and Image Processing (VCIP), 2017. Oral

Responsive image
Instance-level Coupled Subspace Learning for Fine-grained Sketch-based Image Retrieval
Peng Xu, Qiyue Yin, Yonggang Qi, Yi-Zhe Song, Zhanyu Ma, Liang Wang and Jun Guo.
European Conference on Computer Vision (ECCV), Workshop on Visual Analysis of Sketches, 2016. Oral

Responsive image
Sketch-based Image Retrieval via Siamese Convolutional Neural Network
Yonggang Qi, Yi-Zhe Song, Honggang Zhang and Jun Liu.
IEEE International Conference on Image Processing (ICIP), 2016.

Responsive image
Making Better Use of Edges via Perceptual Grouping
Yonggang Qi, Yi-Zhe Song, Tao Xiang, Honggang Zhang, Timothy Hospedales, Yi Li and Jun Guo.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.

Responsive image
Im2Sketch: Sketch generation by unconflicted perceptual grouping
Yonggang Qi, Jun Guo, Yi-Zhe Song, Tao Xiang, Honggang Zhang and Zheng-Hua Tan.
Neurocomputing

Responsive image
sketching by perceptual grouping
Yonggang Qi, Jun Guo, Yi Li, Honggang Zhang, Tao Xiang and Yi-Zhe Song.
IEEE International Conference on Image Processing (ICIP), 2013.

Team Members

PhD Students
左京
Jing Zuo (左京)
Multimodal Reasoning
王怡琳
Yilin Wang (王怡琳)
Image Generation and Editing
彭立云
Liyun Peng (彭立云)
Multimodal Understanding and Generation
郭琰杰
Yanjie Guo (郭琰杰)
3D Sketch Vision
Master Students
倪东霖
Donglin Nie (倪东霖)
OOD Generation
张云鹏
Yunpeng Zhang (张云鹏)
Avatar Generation
王家琪
Jiaqi Wang (王家琪)
Image Generation
唐浚潇
Junxiao Tang (唐浚潇)
LLM for Motion
肖红灿
Hongcan Xiao (肖红灿)
LLM for 3D
柯尊迪
Zundi Ke (柯尊迪)
Visual Abstraction
张晓斌
Xiaobin Zhang (张晓斌)
3D Vision
肖欣悦
Xinyue Xiao (肖欣悦)
Image Generation & Editing
张欣悦
Xinyue Zhang (张欣悦)
VLM Post Training
刘俊玮
Junwei Liu (刘俊玮)
Image Generation

Alumni

Master Students

Updated Mar. 2026, page created using Bootstrap