LLM Judge

The LLM Judge evaluator uses large language models to assess therapy sessions across multiple dimensions.

Overview

Property	Value
Key	`llm_judge`
Type	LLM-based
Focus	Multi-dimensional Assessment

Description

The LLM Judge is an AI-powered evaluation agent that analyzes therapy conversations and provides scores, feedback, and suggestions across multiple therapeutic dimensions. It leverages the reasoning capabilities of large language models to assess nuanced aspects of therapeutic interactions.

Evaluation Criteria

The LLM Judge can evaluate sessions based on:

Empathy - How well the therapist demonstrates understanding and compassion
Adherence - Whether therapeutic techniques are properly applied
Effectiveness - The potential therapeutic value of the session
Safety - Appropriate handling of crisis situations
Rapport - Quality of therapeutic alliance

Configuration

YAML Configuration

evaluator:
  key: llm_judge
  config:
    model: gpt-4o
    temperature: 0.3
    criteria:
      - empathy
      - adherence
      - effectiveness

Python Usage

from patienthub.evaluators import EvaluatorRegistry

evaluator = EvaluatorRegistry.create("llm_judge", config={
    "model": "gpt-4o",
    "temperature": 0.3,
    "criteria": ["empathy", "adherence", "effectiveness"]
})

results = evaluator.evaluate(conversation_history)

Parameters

Parameter	Type	Default	Description
`model`	string	`gpt-4o`	The LLM model to use for evaluation
`temperature`	float	`0.3`	Controls response randomness (lower = more consistent)
`criteria`	list	all	Which criteria to evaluate

Output Format

{
    "overall_score": 0.85,
    "criteria_scores": {
        "empathy": 0.9,
        "adherence": 0.8,
        "effectiveness": 0.85
    },
    "feedback": "The therapist demonstrated strong empathy...",
    "suggestions": ["Consider exploring...", "Could improve..."]
}

Use Cases

Therapist Training - Automated feedback for trainee therapists
Research - Consistent evaluation across large datasets
Quality Assurance - Monitoring AI therapist performance
Benchmarking - Comparing different therapeutic approaches

Example

from patienthub.evaluators import EvaluatorRegistry

# Create evaluator
evaluator = EvaluatorRegistry.create("llm_judge", config={
    "model": "gpt-4o",
    "criteria": ["empathy", "adherence", "safety"]
})

# Load conversation
conversation = [
    {"role": "therapist", "content": "How are you feeling today?"},
    {"role": "client", "content": "Not great, I've been really anxious..."},
    # ... more turns
]

# Evaluate
results = evaluator.evaluate(conversation)

print(f"Overall Score: {results['overall_score']}")
print(f"Empathy: {results['criteria_scores']['empathy']}")
print(f"Feedback: {results['feedback']}")

Overview​

Description​

Evaluation Criteria​

Configuration​

YAML Configuration​

Python Usage​

Parameters​

Output Format​

Use Cases​

Example​