Conv Judge

conv_judge is the conversation evaluator in PatientHub. It uses an LLM plus a prompt-defined schema to score or inspect a therapy session and return structured results.

Use Cases

Evaluate standardized-patient consistency across a full session
Run prompt-defined inspection tasks such as contradiction extraction
Benchmark different client simulators with the same rubric

Notes

prompt_path must be provided by configuration. ConvJudge itself does not hard-code client_conv.yaml.
Use a model that supports structured response schemas. The evaluator depends on structured output for response_format.

Overview

Property	Value
Key	`conv_judge`
Class	`patienthub.evaluators.conv.ConvJudge`
Base Class	`patienthub.evaluators.base.LLMJudge`
Primary Input	A session-like payload containing `messages`
Prompt Source	`configs.prompt_path`
Modes	`session`, `turn`, `turn_by_turn`
Output Style	Prompt-driven structured JSON

How It Works

conv_judge is split across two files:

patienthub/evaluators/conv.py prepares conversation data.
patienthub/evaluators/base.py loads the prompt, builds the schema, calls the model, and returns structured output.

At runtime, the evaluator:

Loads the YAML prompt from prompt_path
Reads dimensions from that YAML
Dynamically builds Pydantic response models for each dimension
Formats the conversation according to granularity
Renders the prompt with Jinja2
Calls the chat model with response_format=<dimension model>
Returns one structured result per dimension

Configuration

Hydra Configuration

defaults:
  - _self_
  - evaluator: conv_judge

evaluator:
  prompt_path: data/prompts/evaluator/client_conv.yaml
  granularity: session
  use_reasoning: false
  model_type: OPENAI
  model_name: gpt-4o

Python Usage

from omegaconf import OmegaConf
from patienthub.evaluators import get_evaluator
from patienthub.utils import load_json

session = load_json("data/sessions/default/badtherapist.json")

configs = OmegaConf.create({
    "agent_name": "conv_judge",
    "prompt_path": "data/prompts/evaluator/client_conv.yaml",
    "granularity": "session",
    "use_reasoning": False,
    "model_type": "OPENAI",
    "model_name": "gpt-4o",
})

evaluator = get_evaluator(agent_name="conv_judge", configs=configs, lang="en")
results = evaluator.evaluate(session)
print(results)

Command Line

patienthub evaluate \
  evaluator=conv_judge \
  evaluator.prompt_path=data/prompts/evaluator/client_conv.yaml \
  evaluator.granularity=session \
  evaluator.model_type=OPENAI \
  evaluator.model_name=gpt-4o \
  input_dir=data/sessions/default/badtherapist.json \
  output_dir=data/evaluations/default/conv_judge_demo.json

Prompt Structure

conv_judge does not hard-code a specific prompt file. Instead, it reads whatever YAML file you provide in prompt_path.

A minimal prompt looks like this:

sys_prompt: |
  You are evaluating a client in a therapy conversation.

  ## Conversation History:
  {{data.conv_history}}

  {% if data.last_response %}
  ## Latest Client Response:
  {{data.last_response}}
  {% endif %}

  Return the judgment in the required structured format.

dimensions:
  - name: consistency
    description: Whether the client is consistent across the conversation.
    paradigm: scalar
    range: [1, 5]
    aspects:
      - name: conv_factual
        description: Whether the client contradicts themselves across turns

The bundled example prompt:

data/prompts/evaluator/client_conv.yaml

Supported Paradigms

Each dimension in the prompt must use one of the paradigms supported by LLMJudge:

Paradigm	Returned Field	Example
`binary`	`label: bool`	`{"label": true}`
`scalar`	`score: int`	`{"score": 4}`
`categorical`	`label: Literal[...]`	`{"label": "very consistent"}`
`extraction`	`snippets: List[str]`	`{"snippets": ["Client: ..."]}`

If use_reasoning=true, each leaf result also includes a reasoning field.

Input Format

conv_judge expects a session-like object with a messages field:

{
  "profile": { }, //(optional)
  "messages": [
    {"role": "therapist", "content": "How are you feeling today?"},
    {"role": "client", "content": "I have been anxious lately."},
    {"role": "therapist", "content": "When did it start?"},
    {"role": "client", "content": "About two months ago."}
  ]
}

Behavior by granularity:

session: evaluates the entire flattened conversation
turn: evaluates only the last response, while still providing prior context
turn_by_turn: evaluates each target turn individually (the target role defaults to client), returning one result per turn

Output Format

The output shape is determined by your prompt's dimensions.

For the minimal prompt above, the result would look like:

{
  "consistency": {
    "conv_factual": {
      "score": 4
    }
  }
}

If use_reasoning=true, the leaf result includes a short explanation:

{
  "consistency": {
    "conv_factual": {
      "score": 4,
      "reasoning": "The client is mostly consistent and does not show major contradictions."
    }
  }
}

With a richer prompt such as data/prompts/evaluator/client_conv.yaml, you can define multiple dimensions and paradigms in the same evaluation run.

Use Cases​

Overview​

How It Works​

Configuration​

Hydra Configuration​

Python Usage​

Command Line​

Prompt Structure​

Supported Paradigms​

Input Format​

Output Format​

See Also​