Profile

Overview

My research focuses on evaluation of large language models and LLM-based agents. I believe that good evaluation should go beyond ranking — it should reflect how models are actually used and reveal where they fall short in realistic settings, such as multi-turn dialogue, agentic tasks, and multimodal reasoning.

I am currently at USTC, advised by Prof. Qi Chu and Prof. Nenghai Yu. Previously, I was an algorithm intern at ByteDance Seed (Nov. 2024 – Dec. 2025), working on the Seed-Evaluation team.

I am actively looking for research internship opportunities. Feel free to reach out if you are interested in collaboration.

Research
Evaluation of LLMs and LLM-based agents.
Approach
Evaluations grounded in realistic scenarios that go beyond ranking.

Highlights

News

  1. 2026-05

    WorldTravel was accepted to ICML 2026 as a poster, presenting a realistic multimodal travel-planning benchmark spanning 150 real-world scenarios and 2,000+ rendered webpages.

  2. 2026-04

    ExcelBench was released by Humanlaya as a benchmark report for agentic spreadsheet work, evaluating formula construction, formatting control, cross-sheet dependencies, and safe editing.

  3. 2026-04

    When Agents Look the Same was accepted to ACL 2026 Main Conference, introducing RPS and AGS to quantify distillation-induced similarity in LLM agent tool-use behavior.

  4. 2025-11

    DiscoX introduced an expert-domain discourse-level translation benchmark focused on document coherence, terminology consistency, and cross-sentence faithfulness.

  5. 2025-11

    MME-CC introduced a multimodal benchmark for cognitive-capacity evaluation that stresses reasoning-intensive visual-language understanding rather than shallow perception.

  6. 2025-09

    FinSearchComp introduced a financial search-and-reasoning benchmark with open data, simulating analyst-style workflows such as time-sensitive retrieval, evidence synthesis, and multi-step investigation.

  7. 2025-09

    MARS-Bench was accepted to EMNLP 2025 Findings as a benchmark for long interactive sports-commentary dialogue, emphasizing motivation transfer, cross-turn dependency, and multi-turn robustness.

Show older news
  1. 2025-02

    CryptoX introduced a compositional reasoning benchmark for LLMs, using cryptography-inspired structure to isolate reasoning gaps that broad QA evaluations often obscure.

  2. 2025-01

    Hello Again! was accepted to NAACL 2025 for its study of long-term personalized dialogue with memory retrieval and dynamic persona modeling across sessions.

  3. 2024-06

    The preprint version of Hello Again! introduced a model-agnostic personalized dialogue agent for long-term memory and persona-aware interaction.

  4. 2023-12

    Joined LDS Lab at USTC and began research on LLM evaluation, dialogue systems, and safety-oriented questions.

Research

Publications

= first author or co-first author.

  1. 2026 ICML

    Lead authorship. WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

    Zexuan Wang*, Chenghao Yang*, Yingqi Que, Zhenzhu Yang, Huaqing Yuan, Yiwen Wang, Zhengxuan Jiang, Shengjie Fang, Zhenhe Wu, Zhaohui Wang, Zhixin Yao, Jiashuo Liu, Jincheng Ren, Yuzhen Li, Yang Yang, Jiaheng Liu, Jian Yang, Zaiyuan Wang, Ge Zhang, Zhoufutu Wen, Wenhao Huang

    ICML 2026 Poster · * equal contribution · † corresponding author

  2. 2026 ACL

    Lead authorship. When Agents Look the Same: Quantifying Distillation-Induced Similarity in Tool-Use Behaviors

    Chenghao Yang, Yuning Zhang, Zhoufutu Wen, Tao Gong, Jiaheng Liu, Qi Chu, Nenghai Yu

    ACL 2026 Main Conference · † corresponding author

  3. 2026 ICLR

    FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

    Liang Hu, Jianpeng Jiao, Jiashuo Liu, Yanle Ren, Zhoufutu Wen, Kaiyuan Zhang, Xuanliang Zhang, Xiang Gao, Tianci He, Fei Hu, Yali Liao, Zaiyuan Wang, Chenghao Yang, Qianyu Yang, Mingren Yin, Zhiyuan Zeng, Ge Zhang, Xinyi Zhang, Xiying Zhao, Zhenwei Zhu, Hongseok Namkoong, Wenhao Huang, Yuwen Tang

    ICLR 2026 Poster · ByteDance Seed

  4. 2026 ICLR

    DiscoX: Benchmarking Discourse-Level Translation in Expert Domains

    Xiying Zhao, Zhoufutu Wen, Zhixuan Chen, Jingzhe Ding, Jianpeng Jiao, Shuai Li, Xi Li, Danni Liang, Shengda Long, Qianqian Liu, Xianbo Wu, Hongwan Gao, Xiang Gao, Liang Hu, Jiashuo Liu, Mengyun Liu, Weiran Shi, Chenghao Yang, Qianyu Yang, Xuanliang Zhang, Ge Zhang, Wenhao Huang

    ICLR 2026 Poster · ByteDance Seed

  5. 2025 NAACL

    Lead authorship. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue

    Hao Li*, Chenghao Yang*, An Zhang, Yang Deng, Xiang Wang, Tat-Seng Chua

    NAACL 2025 (Long Paper) · * equal contribution · † corresponding author

  6. 2025 EMNLP

    Lead authorship. MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation

    Chenghao Yang*, Yinbo Luo*, Zhoufutu Wen, Qi Chu, Tao Gong, Longxiang Liu, Kaiyuan Zhang, Jianpeng Jiao, Ge Zhang, Wenhao Huang, Nenghai Yu

    EMNLP 2025 Findings · * equal contribution · † corresponding author

Preprints & Manuscripts

  1. 2026 Benchmark

    Lead authorship. ExcelBench: Evaluating how far models can go in agentic spreadsheet work

    Chenghao Yang, Yanglihong Xiao, Zijie Wang, Zhendong Yu, Tao Peng, Zaiyuan Wang, Jinhu Feng, Chao Xia, Peng Chen, Baozhi Liu, Qianliang Huang, Jianpeng Jiao, Zhoufutu Wen

    Humanlaya benchmark report, 2026 · core contribution / first author · † corresponding author

  2. 2026 Manuscript

    Lead authorship. Agent4Weakness: An Agentic Framework for In-Depth Model Weakness Discovery

    Xuanliang Zhang*, Chenghao Yang*, Zhoufutu Wen, Dingzirui Wang, Ge Zhang, Xiying Zhao, Tianren Feng, Jianpeng Jiao, Jingkai Liu, Zaiyuan Wang, Zuo Wang, Wenya Wu, Zhou Huan, Jin Chen, Wenhao Huang, Qingfu Zhu, Wanxiang Che

    under review · * equal contribution · † corresponding author

  3. 2025 arXiv

    Lead authorship. MME-CC: A Challenging Multi-Modal Evaluation Benchmark of Cognitive Capacity

    Kaiyuan Zhang*, Chenghao Yang*, Zhoufutu Wen, Sihang Yuan, Qiuyue Wang, Chaoyi Huang, Guosheng Zhu, He Wang, Huawenyu Lu, Jianing Wen, Jianpeng Jiao, Lishu Luo, Longxiang Liu, Sijin Wu, Xiaolei Zhu, Xuanliang Zhang, Ge Zhang, Yi Lin, Guang Shi, Chaoyou Fu, Wenhao Huang

    arXiv 2025 · * equal contribution · † corresponding author

  4. 2025 arXiv

    CryptoX: Compositional Reasoning Evaluation of Large Language Models

    Jiajun Shi*, Chaoren Wei*, Liqun Yang*, Zekun Moore Wang, Chenghao Yang, Ge Zhang, Stephen Huang, Tao Peng, Jian Yang, Zhoufutu Wen

    arXiv 2025 · * equal contribution · † corresponding author

Background

Experience

  1. 2025-

    M.Eng., Cyberspace Security

    University of Science and Technology of China

    Advisors: Prof. Qi Chu & Prof. Nenghai Yu

    Research on evaluation of LLMs and LLM-based agents.

  2. 2024.11–2025.12

    Algorithm Intern

    ByteDance Seed, Seed-Evaluation Team

    Beijing, China

    Developed realistic evaluation pipelines and benchmark suites for large-model applications.

  3. 2023-2024

    Research Intern

    NExT++ Lab, National University of Singapore

    Remote · Supervised by Research Fellow An Zhang

  4. 2021-2025

    B.Eng., Information Security

    University of Science and Technology of China

Honors

Awards

  1. Outstanding Graduate Award USTC 2025
  2. Wang Xiaomo Talent Program Scholarship ×4 USTC 2021-2024