Chenghao Yang | USTC - LLM & Agent Evaluation

Profile

Overview

My research focuses on evaluation of large language models and LLM-based agents. I believe that good evaluation should go beyond ranking — it should reflect how models are actually used and reveal where they fall short in realistic settings, such as multi-turn dialogue, agentic tasks, and multimodal reasoning.

I am currently at USTC, advised by Prof. Qi Chu and Prof. Nenghai Yu. Previously, I was an algorithm intern at ByteDance Seed (Nov. 2024 – Dec. 2025), working on the Seed-Evaluation team.

I am actively looking for research internship opportunities. Feel free to reach out if you are interested in collaboration.

Research: Evaluation of LLMs and LLM-based agents.
Approach: Evaluations grounded in realistic scenarios that go beyond ranking.
Links: Email Google Scholar GitHub Citation trends

Highlights

News

2026-05

WorldTravel was accepted to ICML 2026 as a poster, presenting a realistic multimodal travel-planning benchmark spanning 150 real-world scenarios and 2,000+ rendered webpages.

Accepted Paper
2026-04

ExcelBench was released by Humanlaya as a benchmark report for agentic spreadsheet work, evaluating formula construction, formatting control, cross-sheet dependencies, and safe editing.

Benchmark Report
2026-04

When Agents Look the Same was accepted to ACL 2026 Main Conference, introducing RPS and AGS to quantify distillation-induced similarity in LLM agent tool-use behavior.

Accepted Paper
2025-11

DiscoX introduced an expert-domain discourse-level translation benchmark focused on document coherence, terminology consistency, and cross-sentence faithfulness.

arXiv Paper
2025-11

MME-CC introduced a multimodal benchmark for cognitive-capacity evaluation that stresses reasoning-intensive visual-language understanding rather than shallow perception.

arXiv Paper
2025-09

FinSearchComp introduced a financial search-and-reasoning benchmark with open data, simulating analyst-style workflows such as time-sensitive retrieval, evidence synthesis, and multi-step investigation.

arXiv Paper Data
2025-09

MARS-Bench was accepted to EMNLP 2025 Findings as a benchmark for long interactive sports-commentary dialogue, emphasizing motivation transfer, cross-turn dependency, and multi-turn robustness.

Accepted Paper Project

Show older news

2025-02

CryptoX introduced a compositional reasoning benchmark for LLMs, using cryptography-inspired structure to isolate reasoning gaps that broad QA evaluations often obscure.

arXiv Paper Code
2025-01

Hello Again! was accepted to NAACL 2025 for its study of long-term personalized dialogue with memory retrieval and dynamic persona modeling across sessions.

Accepted Paper
2024-06

The preprint version of Hello Again! introduced a model-agnostic personalized dialogue agent for long-term memory and persona-aware interaction.
2023-12

Joined LDS Lab at USTC and began research on LLM evaluation, dialogue systems, and safety-oriented questions.

Research

Publications

= first author or co-first author.

2026 ICML

Lead authorship. WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Zexuan Wang^*, Chenghao Yang^*, Yingqi Que, Zhenzhu Yang, Huaqing Yuan, Yiwen Wang, Zhengxuan Jiang, Shengjie Fang, Zhenhe Wu, Zhaohui Wang, Zhixin Yao, Jiashuo Liu, Jincheng Ren, Yuzhen Li, Yang Yang, Jiaheng Liu, Jian Yang, Zaiyuan Wang, Ge Zhang, Zhoufutu Wen^†, Wenhao Huang

ICML 2026 Poster · * equal contribution · † corresponding author

Paper
2026 ACL

Lead authorship. When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

Chenghao Yang, Yuning Zhang, Zhoufutu Wen^†, Tao Gong^†, Jiaheng Liu, Qi Chu, Nenghai Yu

ACL 2026 Main Conference · † corresponding author

Paper Code
2026 ICLR

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang

ICLR 2026 Poster · ByteDance Seed

Paper Code Project Data
2026 ICLR

DiscoX: Benchmarking Discourse-Level Translation in Expert Domains

Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang

ICLR 2026 Poster · ByteDance Seed

Paper Project Code Data
2025 NAACL

Lead authorship. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

Hao Li^*, Chenghao Yang^*, An Zhang^†, Yang Deng, Xiang Wang, Tat-Seng Chua

NAACL 2025 (Long Paper) · * equal contribution · † corresponding author

Paper Code
2025 EMNLP

Lead authorship. MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Chenghao Yang^*, Yinbo Luo^*, Zhoufutu Wen^†, Qi Chu^†, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu

EMNLP 2025 Findings · * equal contribution · † corresponding author

Paper Code Project

Preprints & Manuscripts

2026 Benchmark

Lead authorship. ExcelBench: Evaluating how far models can go in agentic spreadsheet work

Chenghao Yang, Yanglihong Xiao, Zijie Wang, Zhendong Yu, Tao Peng, Zaiyuan Wang, Jinhu Feng, Chao Xia, Peng Chen, Baozhi Liu, Qianliang Huang, Jianpeng Jiao^†, Zhoufutu Wen^†

Humanlaya benchmark report, 2026 · core contribution / first author · † corresponding author

Report
2026 Manuscript

Lead authorship. Agent4Weakness: An Agentic Framework for In-Depth Model Weakness Discovery

Xuanliang Zhang^*, Chenghao Yang^*, Zhoufutu Wen^†, Dingzirui Wang, Ge Zhang, Xiying Zhao, Tianren Feng, Jianpeng Jiao, Jingkai Liu, Zaiyuan Wang, Zuo Wang, Wenya Wu, Zhou Huan, Jin Chen, Wenhao Huang, Qingfu Zhu, Wanxiang Che

under review · * equal contribution · † corresponding author
2025 arXiv

Lead authorship. MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

Kaiyuan Zhang^*, Chenghao Yang^*, Zhoufutu Wen^†, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

arXiv 2025 · * equal contribution · † corresponding author

Paper
2025 arXiv

CryptoX: Compositional Reasoning Evaluation of Large Language Models

Jiajun Shi^*, Chaoren Wei^*, Liqun Yang^*, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang^†, Zhoufutu Wen^†

arXiv 2025 · * equal contribution · † corresponding author

Paper Code Leaderboard

Background

Experience

2025-

M.Eng., Cyberspace Security

University of Science and Technology of China

Advisors: Prof. Qi Chu & Prof. Nenghai Yu

Research on evaluation of LLMs and LLM-based agents.
2024.11–2025.12

Algorithm Intern

ByteDance Seed, Seed-Evaluation Team

Beijing, China

Developed realistic evaluation pipelines and benchmark suites for large-model applications.
2023-2024

Research Intern

NExT++ Lab, National University of Singapore

Remote · Supervised by Research Fellow An Zhang
2021-2025

B.Eng., Information Security

University of Science and Technology of China

Honors

Awards

Outstanding Graduate Award USTC 2025
Wang Xiaomo Talent Program Scholarship ×4 USTC 2021-2024

Overview

News

Publications

Lead authorship. WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Lead authorship. When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

DiscoX: Benchmarking Discourse-Level Translation in Expert Domains

Lead authorship. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

Lead authorship. MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

Lead authorship. ExcelBench: Evaluating how far models can go in agentic spreadsheet work

Lead authorship. Agent4Weakness: An Agentic Framework for In-Depth Model Weakness Discovery

Lead authorship. MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

CryptoX: Compositional Reasoning Evaluation of Large Language Models

Experience

M.Eng., Cyberspace Security

Algorithm Intern

Research Intern

B.Eng., Information Security

Awards