CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Yahan Li1, Jifan Yao2, John Bosco S. Bunyi3, Adam C. Frank4, Angel Hwang5, Ruishan Liu1,
1Department of Computer Science, University of Southern California 2Department of Electrical and Computer Engineering, University of Southern California 3Suzanne Dworak-Peck School of Social Work, University of Southern California 4Department of Psychiatry and the Behavioral Sciences, University of Southern California 5Annenberg School for Communication, University of Southern California

Overview of CounselBench

counselbench-overview

Abstract

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios.

The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts.

To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Evaluation of 3,240 responses from nine LLMs reveals consistent, model-specific failure patterns.

Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

CounselBench-Eval: Expert Evaluation of LLMs for Open-ended Mental-health QA

RQ1: What Is the Expert-Rated Performance of LLMs and Humans on Open-Ended Mental-Health QA?

counselbench-eval-annotator-distribution
In collaboration with clinical psychologists, we hired 100 licensed or trained mental health professionals to evaluate responses from GPT-4, Llama-3.3, Gemini-1.5-Pro, and online human therapists (questions and human responses were sourced from CounselChat) on overall quality, empathy, specificity, factual consistency, medical advice, and toxicity. In addition to assigning numeric scores, annotators supplied span-level labels and detailed rationales for overall quality, medical advice, factual consistency, and toxicity.
counselbench-eval
  • Across most major evaluation dimensions, LLM-generated responses significantly outperformed those written by online human therapists.
  • Among the models, LLaMA-3.3 received the highest overall ratings, leading on five of six dimensions.
  • Despite strong scores, some model outputs were flagged for containing medical advice (e.g., recommending therapy techniques) that should be provided by licensed professionals, highlighting an important safety concern.

RQ2: Can LLMs reliably judge the quality of responses?

Interpolate start reference image.
Human expert evaluation provides gold-standard insight but is costly and difficult to scale. To explore whether LLMs can serve as a scalable alternative, we tested eight advanced LLMs as automated judge and found that:
  • Compared with human expert annotations, most LLM-based evaluations consistently yielded inflated scores relative to human ratings.
  • Considering rankings, the preferences expressed by LLM judges diverged sharply from those of human annotators.
  • We also tested whether LLM judges could identify problematic content at the sentence level and we found that LLM judges rarely flag problematic text (see Table 10 in the Appendix).

CounselBench-Adv: An Adversarial Benchmark for Surfacing LLM Failures

RQ3: What kinds of failure modes do LLMs exhibit, and can those failure modes be systematically elicited?

counselbench-adv
We extracted six concrete failure modes from the expert annotations in CounselBench-Eval:
  1. The response provides specific medication (1. medication)
  2. The response suggests specific therapy techniques suggestions (2. therapy)
  3. The response speculates about medical symptoms (3. symptoms)
  4. The response is judgmental (4. judmental)
  5. The response is apathetic (5. apathetic)
  6. The response is based on unsupported assumptions (6. assumptions).
We rehired 10 mental health professionals to write 120 adversarial questions that are designed to trigger these failure modes. Table 4 reports the frequency of each failure mode across models.

BibTeX

@misc{li2025counselbenchlargescaleexpertevaluation,
      title={CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering}, 
      author={Yahan Li and Jifan Yao and John Bosco S. Bunyi and Adam C. Frank and Angel Hwang and Ruishan Liu},
     https://arxiv.org/help/api/index year={2025},
      eprint={2506.08584},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.08584}, 
}
If you use the data of CounselBench, we kindly ask that you cite the original CounselChat dataset (the source of all questions and human responses) as well as CounselBench (with our 2000 expert evaluations and 120 adversarial questions):
@misc{bertagnolli2020counsel,
  title={Counsel chat: Bootstrapping high-quality therapy data},
  author={Bertagnolli, Nicolas},
  year={2020},
  publisher={Towards Data Science. https://towardsdatascience. com/counsel-chat~…}
}