MS · Computer Vision & Image Processing Lab · Sogang University
Graduate researcher advised by Prof. Unsang Park. I work on vision–language multimodal models and diffusion-based generation — from training-free control of diffusion attention to how we evaluate multimodal systems.
Vision–Language Multimodal / Diffusion Models / Multimodal Evaluation
I'm a master's student in the Computer Vision and Image Processing Lab (CVIP) at Sogang University, advised by Prof. Unsang Park. Before joining CVIP, I was an undergraduate researcher at the AI Accelerator Lab at Hallym University and a research intern at ETRI.
My current research treats cross-attention in diffusion models as a signal that can be analyzed and edited at inference time, and asks how vision–language models should be measured — designing diagnostics that separate what a model knows from how a benchmark is built.
Earlier work spanned sleep-stage classification, multi-modal physiological signals, and applied competitions in detection, segmentation, and OCR.
Under review (BMVC 2026). A memory-and-training framework for long-horizon, fixed-camera nature video generation that balances spatial persistence with motion continuity — sustaining plausible fluid dynamics such as water, fire, and smoke over multi-minute autoregressive rollouts.
Under review (CIKM 2026). A matched original / permutation / hard diagnostic protocol probing how vision–language models respond to same-type, visually plausible near-miss distractors — showing that evidence-first prompting is a model-dependent intervention rather than a universal fix.
Under review (ECCV 2026). An inference-time method that modulates cross-attention logits in the Fourier domain to control generation without retraining, while largely preserving semantic alignment.
Expert Systems with Applications (ESWA), 2026
Int'l Conf. on ICT Convergence (ICTC), 2024 · ETRI Human Understanding AI Paper Challenge
arXiv preprint · SOTA on SleepEDF-78 and SHHS
Annual Symposium of KIPS (ASK), 2024 · in Korean
Joint Conf. on Communications and Information (JCCI), 2023 · in Korean
Transformer-based, PSG-free prediction of obstructive sleep apnea severity from multi-view facial images. Basis for the ESWA publication.
Class-level forgetting by shifting the decision boundary of a trained network — motivated by deepfake and data-removal settings.
github ↗Object detection + OCR pipeline that reads nutrition labels and logs personalized intake. Four-person team capstone project.
github ↗Route-adjustment system that recommends safer ship paths from ocean-state conditions to help prevent maritime accidents.
github ↗