Chenghao Yang | USTC - LLM Safety & Evaluation

Profile

Overview

I am a Master's student in Cyberspace Security at USTC, advised by Prof. Qi Chu and Prof. Nenghai Yu.

My research focuses on LLM evaluation, safety, alignment, and realistic benchmark design, with recent work on dialogue, search, reasoning, and multimodal systems; from Nov. 2024 to Dec. 2025, I was an algorithm intern at ByteDance Seed, where I worked on evaluation-centric systems for large-model applications.

Position: M.Eng. student at USTC; former algorithm intern at ByteDance Seed (2024.11–2025.12).
Research: LLM evaluation, safety, alignment, reasoning benchmarks, and AI security.
Approach: Build realistic testbeds and evaluation frameworks that reveal hidden model weaknesses.
Links: Email Google Scholar GitHub Citation trends

Highlights

News

2026-02

WorldTravel introduced a realistic multimodal travel-planning benchmark spanning 150 real-world scenarios and 2,000+ rendered webpages, revealing a sharp drop in feasibility from text-only to multimodal settings.

arXiv Paper
2025-11

DiscoX introduced an expert-domain discourse-level translation benchmark focused on document coherence, terminology consistency, and cross-sentence faithfulness.

arXiv Paper
2025-11

MME-CC introduced a multimodal benchmark for cognitive-capacity evaluation that stresses reasoning-intensive visual-language understanding rather than shallow perception.

arXiv Paper
2025-09

FinSearchComp introduced a financial search-and-reasoning benchmark with open data, simulating analyst-style workflows such as time-sensitive retrieval, evidence synthesis, and multi-step investigation.

arXiv Paper Data
2025-09

MARS-Bench was accepted to EMNLP 2025 Findings as a benchmark for long interactive sports-commentary dialogue, emphasizing motivation transfer, cross-turn dependency, and multi-turn robustness.

Accepted Paper Project

Show older news

2025-02

CryptoX introduced a compositional reasoning benchmark for LLMs, using cryptography-inspired structure to isolate reasoning gaps that broad QA evaluations often obscure.

arXiv Paper Code
2025-01

Hello Again! was accepted to NAACL 2025 for its study of long-term personalized dialogue with memory retrieval and dynamic persona modeling across sessions.

Accepted Paper
2024-06

The preprint version of Hello Again! introduced a model-agnostic personalized dialogue agent for long-term memory and persona-aware interaction.
2023-12

Joined LDS Lab at USTC and began research on LLM evaluation, dialogue systems, and safety-oriented questions.

Research

Publications

2026 ICLR

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang

ICLR 2026 Poster · ByteDance Seed

Paper Code Project Data
2026 ICLR

DiscoX: Benchmarking Discourse-Level Translation in Expert Domains

Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang

ICLR 2026 Poster

Paper Project Code Data
2025 NAACL

Lead authorship. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

Hao Li^*, Chenghao Yang^*, An Zhang^†, Yang Deng, Xiang Wang, Tat-Seng Chua

NAACL 2025 (Long Paper) · * equal contribution · † corresponding author

Paper Code
2025 EMNLP

Lead authorship. MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Chenghao Yang^*, Yinbo Luo^*, Zhoufutu Wen^†, Qi Chu^†, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu

EMNLP 2025 Findings · * equal contribution · † corresponding author

Paper Code Project

Preprints & Manuscripts

2026 arXiv

Recent preprint

Lead authorship. WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Zexuan Wang^*, Chenghao Yang^*, Yingqi Que, Zhenzhu Yang, Huaqing Yuan, Yiwen Wang, Zhengxuan Jiang, Shengjie Fang, Zhenhe Wu, Zhaohui Wang, Zhixin Yao, Jiashuo Liu, Jincheng Ren, Yuzhen Li, Yang Yang, Jiaheng Liu, Jian Yang, Zaiyuan Wang, Ge Zhang, Zhoufutu Wen^†, Wenhao Huang

arXiv preprint, 2026 · * equal contribution · † corresponding author

Paper
2026 Manuscript

Lead authorship. Agent4Weakness: An Agentic Framework for In-Depth Model Weakness Discovery

Xuanliang Zhang^*, Chenghao Yang^*, Zhoufutu Wen^†, Dingzirui Wang, Ge Zhang, Xiying Zhao, Tianren Feng, Jianpeng Jiao, Jingkai Liu, Zaiyuan Wang, Zuo Wang, Wenya Wu, Zhou Huan, Jin Chen, Wenhao Huang, Qingfu Zhu, Wanxiang Che

under review · * equal contribution · † corresponding author
2026 Manuscript

Lead authorship. When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

Chenghao Yang, Yuning Zhang, Zhoufutu Wen^†, Qi Chu^†, Jiaheng Liu, Tao Gong, Nenghai Yu

under review · † corresponding author
2025 arXiv

Lead authorship. MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang^*, Chenghao Yang^*, Zhoufutu Wen^†, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

arXiv 2025 · * equal contribution · † corresponding author

Paper
2025 arXiv

CryptoX: Compositional Reasoning Evaluation of Large Language Models

Jiajun Shi^*, Chaoren Wei^*, Liqun Yang^*, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang^†, Zhoufutu Wen^†

arXiv 2025 · * equal contribution · † corresponding author

Paper Code Leaderboard

Background

Experience

2025-

M.Eng., Cyberspace Security

University of Science and Technology of China

Advisors: Prof. Qi Chu & Prof. Nenghai Yu

Research on LLM safety, evaluation, alignment, and benchmark design.
2024.11–2025.12

Algorithm Intern

ByteDance Seed, Seed-Evaluation Team

Beijing, China

Developed realistic evaluation pipelines and benchmark suites for large-model applications.
2023-2024

Research Intern

NExT++ Lab, National University of Singapore

Remote · Supervised by Research Fellow An Zhang

Conducted research on LLM-based dialogue agents for long-term personalization and memory-aware response generation.
2021-2025

B.Eng., Information Security

University of Science and Technology of China

Coursework and early research in information security, machine learning, and AI security.

Honors

Awards

Outstanding Graduate Award USTC 2025
Outstanding Student Bronze Award USTC 2024
Wang Xiaomo Talent Program Scholarship ×4 USTC 2021-2024

Overview

News

Publications

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

DiscoX: Benchmarking Discourse-Level Translation in Expert Domains

Lead authorship. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

Lead authorship. MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Lead authorship. WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Lead authorship. Agent4Weakness: An Agentic Framework for In-Depth Model Weakness Discovery

Lead authorship. When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

Lead authorship. MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

CryptoX: Compositional Reasoning Evaluation of Large Language Models

Experience

M.Eng., Cyberspace Security

Algorithm Intern

Research Intern

B.Eng., Information Security

Awards