Mr. En Yu | Artificial Intelligence in Statistics | Excellence in Research Award
Huazhong University of Science and Technology | China
Mr. En Yu is a PhD student in the Department of Intelligence Science and Technology at Huazhong University of Science and Technology (HUST), where he also completed his M.S. and B.Eng. degrees, building a strong foundation in automation, intelligent systems, and multimodal learning. His research focuses on visual perception, spatial intelligence, and multimodal large language models (MLLMs), with pioneering contributions in image and video understanding, multimodal reasoning, and reinforcement learning–based alignment for foundation models. He has produced influential work, including research on anti-scaling laws and temporal hacking in video MLLMs, the development of future-prediction multimodal reasoning models, perception policy learning with RL for visual tasks, and breakthroughs in fully end-to-end multi-object tracking. His earlier work established new directions in cross-domain tracking with natural-language representations, contrastive multi-object tracking, and decoupled representation learning for relation-aware MOT. Yu has further advanced spatial intelligence through contributions to 3D multi-object tracking and open-vocabulary tracking, bridging perception, reasoning, and robust scene understanding. He has interned at MEGVII Technology in the Foundation Model Group under Xiangyu Zhang, at StepFun AI in the Multimodal LLM Group under Zheng Ge, and at the UCSB NLP Group under William Wang as a visiting PhD researcher, contributing to cutting-edge multimodal systems and video-language modeling. his work is rapidly shaping next-generation MLLMs and visual reasoning systems. He actively serves as a reviewer for top AI conferences including NeurIPS, CVPR, ICCV, ECCV, ICML, ICLR, and leading journals such as TMM and TCSVT. His current interests span synthetic multimodal data generation, supervised and reinforcement post-training for MLLMs, real-world navigation agents, game agents, and spatial perception in visual and multimodal foundation models. Outside research, he enjoys movies, singing, reading, ball games, swimming, and skiing.
Profile: Google Scholar
Featured Publications
Yu, E., Zhao, L., Wei, Y., Yang, J., Wu, D., Kong, L., Wei, H., Wang, T., Ge, Z., et al. (2024). Merlin: Empowering multimodal LLMs with foresight minds. ECCV, 425–443.
Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yu, E., Sun, J., Han, C., & Zhang, X. (2024). Small language model meets with reinforced vision vocabulary. arXiv Preprint, arXiv:2401.12503.
Yu, E., Lin, K., Zhao, L., Yin, J., Wei, Y., Peng, Y., Wei, H., Sun, J., Han, C., Ge, Z., et al. (2025). Perception-R1: Pioneering perception policy with reinforcement learning. NeurIPS.
Chen, S., Yu, E., Li, J., & Tao, W. (2024). Delving into the trajectory long-tail distribution for multi-object tracking. CVPR, 19341–19351.
Li, Z., Han, C., Ge, Z., Yang, J., Yu, E., Wang, H., Zhang, X., & Zhao, H. (2024). GroupLane: End-to-end 3D lane detection with channel-wise grouping. IEEE Robotics and Automation Letters, 24.