PatientZero

Scaling Synthetic Patient Agents to Real-World Distributions without Real Patient Data

Paper: Arxiv

Generates clinically grounded static patient records from disease knowledge and attribute priors. The current implementation follows the Patient-Zero hierarchical synthesis idea: it starts from an abstract disease concept, standardizes disease knowledge, samples patient attributes, generates symptoms, creates examination results, validates the generated content, and appends the final record to a character JSON file.

Overview

Property	Value
Key	`patientZero`
Type	LLM-based + sampled priors
Output	PatientZero character records

Key Features

Disease-grounded generation(Stage I): Uses source-grounded raw_outlines.json and standardized disease_outlines.json as the clinical scaffold.
Attribute sampling(Stage II): Samples demographics, socioeconomic factors, lifestyle factors, communication style, and severity from global and disease-specific priors.
Stage-wise synthesis(Stage III & IV): Generates patient background, symptom trajectory, and examination results in separate pipeline stages.
Verify-and-regenerate loop(Stage V): Validates patient record content and examination results before returning the final profile.
Config-parameterized: Runs from disease_key (and optional random_seed); the generate CLI appends returned cases into one JSON list.

How It Works

generate_character(data=None) runs the full PatientZero pipeline for the target disease. The disease may be supplied per-record via a seed (data["disease_key"]) or fall back to the configured disease_key:

Stage I: Disease outline standardization - if the disease is not already present in disease_outlines.json, the generator converts the matching raw outline into a standardized DiseaseOutline.
Stage II: Attribute permutation - samples a valid patient attribute vector from demo_priors.json plus disease-specific overrides in disease_priors.json.
Stage III: Patient record and symptoms - generates patient_profile and symptom_trajectory from the sampled attributes and disease outline.
Stage IV: Examination results - generates mental status exam, scale assessments, risk assessment, exclusionary findings, and a clinical summary.
Stage V: Validation - checks Stage III and Stage IV outputs. Failed generations are retried with revision guidance.
Return - the validated record is returned; the generate CLI appends it to the output bank.

The final observable profile follows the paper's static record definition P = {B, S, E}:

B: background patient profile
S: symptom trajectory
E: examination results

The disease outline O is used as a latent scaffold and is not part of the role-play schema requirement.

Supported Diseases

The default PatientZero resources currently include the following diseases. These entries have the required raw outline, standardized disease outline, disease-specific attribute priors, and examination references, so they can be used directly as disease_key values.

Disease Key	Disease Name
`adhd`	ADHD
`anxiety_disorder`	Anxiety Disorder
`bipolar_disorder`	Bipolar Disorder
`depression`	Depression
`insomnia`	Insomnia
`ocd`	OCD
`ptsd`	PTSD
`schizophrenia`	Schizophrenia

Usage

PatientZero is config-parameterized — it needs no seed list. Choose the disease with generator.disease_key and run the CLI:

# One case for a disease
patienthub generate generator=patientZero generator.disease_key=depression

# Several cases appended to the output bank
patienthub generate generator=patientZero generator.disease_key=insomnia num_samples=10

Cases are appended to data/characters/patientZero.json (override with output_path).

Alternatively, drive multiple diseases from a seed list at data/seeds/patientZero.json, where each record selects a disease (and optional seed):

patienthub generate generator=patientZero input_path=data/seeds/patientZero.json

Configuration

Parameter	Type	Default	Description
`agent_name`	string	`patientZero`	Generator identifier
`prompt_path`	string	`data/prompts/generator/patientZero.yaml`	Path to PatientZero prompts
`resource_dir`	string	`data/resources/PatientZero`	Folder for source data, priors, examination references, and reusable disease outlines
`disease_key`	string	`depression`	Target disease key, for example `depression` or `insomnia`
`random_seed`	int/null	`None`	Optional seed for reproducible attribute sampling
`model_type`	string	`"OPENAI"`	Model provider key
`model_name`	string	`"gpt-4o"`	Model identifier
`temperature`	float	`0.7`	Sampling temperature
`max_tokens`	int	`8192`	Max response tokens
`max_retries`	int	`3`	API retry attempts

Seed Record Format

PatientZero can run without a seed list (from disease_key), but when driving many diseases from one run, data/seeds/patientZero.json is a JSON list validated against PatientZeroSeed:

[
  { "disease_key": "depression", "random_seed": 0 },
  { "disease_key": "insomnia" }
]

Field	Type	Description
`disease_key`	string	Target disease key
`random_seed`	int/null	Optional seed for reproducible attribute sampling

Output Format

The final output file is a JSON array. Each call appends one validated case:

[
  {
    "metadata": {
      "disease_name": "Major Depressive Disorder",
      "disease_key": "depression",
      "severity_level": "severe",
      "seed": 0,
      "case_index": 0,
      "case_id": "depression_0"
    },
    "patient_profile": {
      "age": 34,
      "age_strata": "Adult",
      "biological_sex": "Female",
      "physiological_status": "Non-pregnant",
      "ethnicity": "Caucasian",
      "geography": "Urban (Metropolitan)",
      "communication_style": "Reserved",
      "name": "Example Patient",
      "lifestyle_summary": "...",
      "past_medical_history": "...",
      "family_history": "..."
    },
    "symptom_trajectory": {
      "current_severity": "severe",
      "chief_complaint": "...",
      "onset": "...",
      "duration": "...",
      "selected_symptoms": [
        {
          "name": "Persistent low mood",
          "duration": "8 weeks",
          "severity": "severe",
          "impact": "..."
        }
      ]
    },
    "examination_results": {
      "mental_status_exam": {
        "appearance_behavior": "...",
        "speech": "...",
        "mood_affect": "..."
      },
      "scale_assessments": [
        {
          "scale_name": "PHQ-9",
          "score": 22,
          "severity_interpretation": "Severe",
          "rationale": "..."
        }
      ],
      "risk_assessment": {
        "suicide_risk": "moderate",
        "violence_risk": "none",
        "self_neglect_risk": "low",
        "rationale": "..."
      },
      "clinical_summary": "..."
    }
  }
]

The role-play schema requires patient_profile, symptom_trajectory, and examination_results. Metadata and validation logs may be present in generator output, but they are not required for adapter-level profile conversion.

Required Data Files

PatientZero depends on both disease knowledge and attribute priors. With the default paths, the generator expects these files:

File	Location	Purpose
`raw_outlines.json`	`resource_dir`	Human-provided disease knowledge gathered from authoritative sources
`demo_priors.json`	`resource_dir`	Global demographic, socioeconomic, lifestyle, and communication distributions
`disease_priors.json`	`resource_dir`	Disease-specific overrides for age, sex, severity, or other sampled attributes
`exam_references.json`	`resource_dir`	Disease-specific clinical scales and exclusionary findings
`disease_outlines.json`	`resource_dir`	Reusable standardized disease outlines used by later generation stages

resource_dir contains both human-provided priors and reusable standardized outlines. disease_outlines.json may be generated by Stage I, but it is saved beside the source files because it can be reused across future runs.

Adding a New Disease

To add a disease, choose a stable normalized key such as social_anxiety_disorder and add matching entries to the source files.

1. Gather Authoritative Disease Knowledge

Prefer official or professional medical sources:

Government or public-health agencies, for example NIMH, CDC, WHO, NIH, NHLBI
Professional clinical organizations or guideline bodies
Peer-reviewed reviews or diagnostic resources when official pages are insufficient

Collect content that can support:

Disease summary and core clinical characteristics
Common symptoms and symptom timing
Severity-specific presentation differences
Special populations, such as children, older adults, pregnancy
Red flags, contraindications, and differential diagnoses
Recommended scales or examinations

2. Add the Raw Outline

Add one entry to data/resources/PatientZero/raw_outlines.json:

{
  "social_anxiety_disorder": {
    "disease_name": "Social Anxiety Disorder",
    "stage": "disease_outline_standardization",
    "retrieval_date": "2026-05-07",
    "evidence_note": "Summarized from official clinical source material for synthetic case generation.",
    "sources": [
      {
        "title": "Source title",
        "publisher": "Source publisher",
        "url": "https://example.org/source-page"
      }
    ],
    "raw_outline": {
      "summary": "...",
      "core_features": ["..."],
      "common_symptoms": ["..."],
      "course_and_timing": ["..."],
      "severity_notes": {
        "mild": "...",
        "moderate": "...",
        "severe": "..."
      },
      "special_populations": ["..."],
      "red_flags": ["..."],
      "differential_considerations": ["..."]
    }
  }
}

The generator can standardize this into disease_outlines.json automatically when disease_key is used and the standardized outline does not yet exist.

3. Add Disease-Specific Attribute Priors

Add one entry to data/resources/PatientZero/disease_priors.json. Each override should include source, method, and labels:

{
  "social_anxiety_disorder": {
    "age_strata": {
      "source": [
        {
          "title": "Source title",
          "url": "https://example.org/prevalence-page",
          "evidence": "Age-specific prevalence or onset information."
        }
      ],
      "method": "labels are normalized PatientZero sampling weights mapped from source age bands.",
      "labels": {
        "Child": 0.25,
        "Adult": 0.7,
        "Elderly": 0.05
      }
    },
    "biological_sex": {
      "source": [
        {
          "title": "Source title",
          "url": "https://example.org/prevalence-page",
          "evidence": "Sex-specific prevalence information."
        }
      ],
      "method": "labels are normalized PatientZero sampling weights mapped from source male/female prevalence.",
      "labels": {
        "Male": 0.45,
        "Female": 0.55
      }
    },
    "severity_level": {
      "source": [
        {
          "title": "Source title",
          "url": "https://example.org/severity-page",
          "evidence": "Severity, impairment, or clinical-presentation distribution information."
        }
      ],
      "method": "labels map source severity or impairment categories to mild/moderate/severe.",
      "labels": {
        "mild": 0.4,
        "moderate": 0.35,
        "severe": 0.25
      }
    }
  }
}

Only fields that differ from the global priors need to be added. Fields not specified here fall back to demo_priors.json.

4. Add Examination References

Add one entry to data/resources/PatientZero/exam_references.json under diseases:

{
  "diseases": {
    "social_anxiety_disorder": {
      "selected_scales": ["GAD-7"],
      "exclusionary_findings": [
        "substance and medication review",
        "screen for panic symptoms when clinically relevant"
      ]
    }
  }
}

selected_scales must refer to scales already defined in scale_pool. If a new scale is needed, first add it to scale_pool with its score range and interpretation notes.

5. Generate the Disease Case

Set generator.disease_key to the new key and run the CLI:

patienthub generate generator=patientZero \
    generator.disease_key=social_anxiety_disorder generator.random_seed=0

If the standardized outline is missing, Stage I writes it to resource_dir/disease_outlines.json. The final validated case is appended to the output bank (data/characters/patientZero.json by default).

Use Cases

Creating role-play-ready psychiatric patient profiles without using real patient records
Building disease-specific synthetic case banks with controlled demographic and severity distributions
Evaluating clinical interview agents with cases that include symptoms, mental status findings, scales, and risk assessment

Overview​

Key Features​

How It Works​

Supported Diseases​

Usage​

Configuration​

Seed Record Format​

Output Format​

Required Data Files​

Adding a New Disease​

1. Gather Authoritative Disease Knowledge​

2. Add the Raw Outline​

3. Add Disease-Specific Attribute Priors​

4. Add Examination References​

5. Generate the Disease Case​

Use Cases​