CounselBench

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

¹Department of Computer Science, University of Southern California ²Department of Electrical and Computer Engineering, University of Southern California ³Suzanne Dworak-Peck School of Social Work, University of Southern California ⁴Department of Psychiatry and the Behavioral Sciences, University of Southern California ⁵Annenberg School for Communication, University of Southern California

Abstract

Medical question answering (QA) benchmarks often focus on multiple-choice or fact-based tasks, leaving open-ended answers to real patient questions underexplored. This gap is particularly critical in mental health, where patient questions often mix symptoms, treatment concerns, and emotional needs, requiring answers that balance clinical caution with contextual sensitivity. We present CounselBench, a large-scale benchmark developed with 100 mental health professionals to evaluate and stress-test large language models (LLMs) in realistic help-seeking scenarios.

The first component, CounselBench-EVAL, contains 2,000 expert evaluations of answers from GPT-4, LLaMA 3, Gemini, and human therapists on patient questions from the public forum CounselChat. Each answer is rated across six clinically grounded dimensions, with span-level annotations and written rationales. Expert evaluations show that while LLMs achieve high scores on several dimensions, they also exhibit recurring issues, including unconstructive feedback, overgeneralization, and limited personalization or relevance. Responses were frequently flagged for safety risks, most notably unauthorized medical advice. Follow-up experiments show that LLM judges systematically overrate model responses and overlook safety concerns identified by human experts.

To probe failure modes more directly, we construct CounselBench-Adv, an adversarial dataset of 120 expert-authored mental health questions designed to trigger specific model issues. Evaluation of 3,240 responses from nine LLMs reveals consistent, model-specific failure patterns.

Together, CounselBench establishes a clinically grounded framework for benchmarking LLMs in mental health QA.

RQ1: What Is the Expert-Rated Performance of LLMs and Humans on Open-Ended Mental-Health QA?

counselbench-eval-annotator-distribution

In collaboration with clinical psychologists, we hired 100 licensed or trained mental health professionals to evaluate responses from GPT-4, Llama-3.3, Gemini-1.5-Pro, and online human therapists (questions and human responses were sourced from CounselChat) on overall quality, empathy, specificity, factual consistency, medical advice, and toxicity. In addition to assigning numeric scores, annotators supplied span-level labels and detailed rationales for overall quality, medical advice, factual consistency, and toxicity.

Across most major evaluation dimensions, LLM-generated responses significantly outperformed those written by online human therapists.
Among the models, LLaMA-3.3 received the highest overall ratings, leading on five of six dimensions.
Despite strong scores, some model outputs were flagged for containing medical advice (e.g., recommending therapy techniques) that should be provided by licensed professionals, highlighting an important safety concern.

RQ2: Can LLMs reliably judge the quality of responses?

Human expert evaluation provides gold-standard insight but is costly and difficult to scale. To explore whether LLMs can serve as a scalable alternative, we tested eight advanced LLMs as automated judge and found that:

Compared with human expert annotations, most LLM-based evaluations consistently yielded inflated scores relative to human ratings.
Considering rankings, the preferences expressed by LLM judges diverged sharply from those of human annotators.
We also tested whether LLM judges could identify problematic content at the sentence level and we found that LLM judges rarely flag problematic text (see Table 10 in the Appendix).

RQ3: What kinds of failure modes do LLMs exhibit, and can those failure modes be systematically elicited?

We extracted six concrete failure modes from the expert annotations in CounselBench-Eval:

The response provides specific medication (1. medication)
The response suggests specific therapy techniques suggestions (2. therapy)
The response speculates about medical symptoms (3. symptoms)
The response is judgmental (4. judmental)
The response is apathetic (5. apathetic)
The response is based on unsupported assumptions (6. assumptions).

We rehired 10 mental health professionals to write 120 adversarial questions that are designed to trigger these failure modes. Table 4 reports the frequency of each failure mode across models.

BibTeX

@misc{li2025counselbenchlargescaleexpertevaluation, title={CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering}, author={Yahan Li and Jifan Yao and John Bosco S. Bunyi and Adam C. Frank and Angel Hwang and Ruishan Liu}, https://arxiv.org/help/api/index year={2025}, eprint={2506.08584}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2506.08584}, }

as well as

@misc{bertagnolli2020counsel, title={Counsel chat: Bootstrapping high-quality therapy data}, author={Bertagnolli, Nicolas}, year={2020}, publisher={Towards Data Science. https://towardsdatascience. com/counsel-chat~…} }

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

Overview of CounselBench

Abstract

CounselBench-Eval: Expert Evaluation of LLMs for Open-ended Mental-health QA

RQ1: What Is the Expert-Rated Performance of LLMs and Humans on Open-Ended Mental-Health QA?

RQ2: Can LLMs reliably judge the quality of responses?

CounselBench-Adv: An Adversarial Benchmark for Surfacing LLM Failures

RQ3: What kinds of failure modes do LLMs exhibit, and can those failure modes be systematically elicited?

BibTeX