I plan to graduate in Oct. 2026 and am actively open to discussing industry opportunities in multimodal AI, agentic systems, and LLM/MLLM training and inference.
I am a Ph.D. candidate in the Dept. of Computer Science & Engineering at HKUST, advised by Prof. Nevin L. Zhang, and a recipient of the Huawei PhD Fellowship (HKUST).
My research focuses on the real-world application of deep vision and vision-language models, with emphasis on explainability, generalization, MLLM-based agentic visual perception, and controllability in image editing.
In general, I aim to develop diagnostic tools to understand what models currently depend on, and targeted mechanisms to guide them toward causally relevant, trustworthy, and efficient behavior.
Multimodal LLMs may rely on language priors rather than pertinent visual evidence, especially on long documents. I explore agentic perception frameworks that gather evidence iteratively to improve accuracy and efficiency.
Controllable editing requires precise spatial and semantic guidance without costly retraining. I develop training-free methods that combine structural control with flexible prompt guidance.
Foundation models can lose robustness during fine-tuning and fail under distribution shift. I design training objectives that anchor decisions to invariant, generalizable features.
Deep classifiers often rely on spurious correlations rather than causally relevant visual evidence. My work develops explanation methods that diagnose misaligned dependencies and surface discriminative rationales.