DD010: Validation Framework and Quantitative Benchmarks¶
- Status: Accepted
- Author: OpenWorm Core Team
- Date: 2026-02-14
- Supersedes: None
- Related: All other DDs (validation applies to all models), DD006 (Neuropeptides — Tier 2b unc-31 validation), DD021 (Movement Analysis Toolbox — Tier 3 validation tool), DD024 (Validation Data Acquisition Pipeline — data sourcing for all tiers)
Phase: Phase 1 | Layer: Validation
TL;DR¶
Every pull request must pass quantitative validation at four levels — single cells, neural network dynamics, whole-animal locomotion, and causal perturbations — before merging. This DD defines the thresholds, data sources, and CI pipeline for each tier.
Quick Action Reference¶
| Question | Answer |
|---|---|
| Phase | Phase 1 |
| Layer | Validation — see Phase Roadmap |
| What does this produce? | Three-tier validation reports: Tier 1 (single-cell electrophysiology), Tier 2 (functional connectivity correlation), Tier 3 (behavioral kinematics via open-worm-analysis-toolbox — see DD021) |
| Success metric | Tier 2a: correlation-of-correlations r > 0.5 vs. Randi 2023; Tier 2b: neuropeptide contribution r > 0.3 (wt-vs-unc-31); Tier 3: 5 kinematic metrics within ±15% of Yemini et al. 2013 Schafer lab data |
| Repository | Validation scripts in openworm/OpenWorm meta-repo; Tier 3 tool: openworm/open-worm-analysis-toolbox (DD021) |
| Config toggle | validation.run_after_simulation: true, validation.tier2_functional_connectivity: true, validation.tier2_neuropeptide_unc31: true, validation.tier3_behavioral: true in openworm.yml |
| Build & test | docker compose run validate — runs all enabled tiers, produces output/validation_report.json |
| Visualize | Validation overlay in DD014 viewer: validation/overlay/ OME-Zarr group shows per-metric pass/fail |
| CI gate | Tier 2 blocks PR merge (r < 0.5 = fail); Tier 3 blocks merge to main (>15% deviation = fail) |
Context & Background¶
OpenWorm's core philosophy, articulated in Sarma et al. 2016 "Unit Testing, Model Validation, and Biological Simulation" (F1000Research), is that model validation is a form of testing. Just as software has unit tests, integration tests, and system tests, biological models must be validated at multiple levels:
- Single-cell level: Electrophysiology (voltage, conductance, kinetics)
- Circuit level: Functional connectivity (calcium correlations)
- Behavioral level: Movement kinematics, pumping, defecation
A simulation that produces movement but fails electrophysiology validation has passed the behavioral test but failed the mechanistic test. Both matter.
Goal & Success Criteria¶
Goal: Automated, multi-tier quantitative validation that runs on every PR and blocks merges when model quality degrades.
Success criteria:
- Tier 2a functional connectivity correlation r > 0.5 vs. Randi 2023 experimental data
- Tier 2b neuropeptide contribution correlation r > 0.3 (wild-type vs. unc-31)
- Tier 3 behavioral kinematics within +/-15% of Yemini et al. 2013 Schafer lab data for all 5 metrics
- Tier 4 causal perturbation direction-of-effect matches >=70% of published interventions
- All tiers run without manual intervention via
docker compose run validate
Deliverables¶
| Artifact | Path | Status |
|---|---|---|
| Tier 1 expression-consistency checker | validation/tier1_expression_consistency.py |
[TO BE CREATED] |
| Tier 1 single-cell electrophysiology comparator | validation/tier1_cell_electrophys.py |
[TO BE CREATED] |
| Tier 2a functional connectivity validator | validation/tier2a_functional_connectivity.py |
[TO BE CREATED] |
| Tier 2b unc-31 neuropeptide validator | validation/tier2b_neuropeptide_unc31.py |
[TO BE CREATED] |
| Tier 3 behavioral kinematics validator | validation/tier3_behavioral.py |
[TO BE CREATED] |
| Tier 4 perturbation battery runner | validation/tier4_perturbation_battery.py |
[TO BE CREATED] |
| Acceptance criteria checker | validation/check_acceptance.py |
[TO BE CREATED] |
| Reference datasets (baked into Docker) | /opt/openworm/validation/data/ |
Partial |
| CI workflow | .github/workflows/validation.yml |
[TO BE CREATED] |
| Validation config section | openworm.yml → validation: |
Defined below |
Repository & Issues¶
- Primary repository:
openworm/OpenWorm(meta-repo — validation scripts live here) - Tier 3 toolbox:
openworm/open-worm-analysis-toolbox(see DD021) - Tier 2 data API:
openworm/wormneuroatlas(Randi 2023 functional connectivity) - Issue label:
dd010,validation - Milestone: Phase 1 — Validation CI Pipeline
- ClickUp task: 868hjdzqy (Validation L4 Maintainer)
How to Build & Test¶
Getting Started (Environment Setup)¶
The validation framework is cross-cutting — it spans multiple repositories depending on which validation tier you are working on.
All tiers — clone the meta-repo:
git clone https://github.com/openworm/OpenWorm.git # meta-repo with docker compose
cd OpenWorm
Per-tier additional repositories:
- Tier 1 (single-cell electrophysiology): Also needs c302 — see DD001 Getting Started for setup
- Tier 2 (circuit-level functional connectivity): Also needs c302 — see DD001 Getting Started for setup
- Tier 3 (behavioral kinematics): Also needs
open-worm-analysis-toolbox— see DD021 Getting Started for setup
Path A — Docker (recommended):
# From the OpenWorm meta-repo
docker compose run validate # runs all tiers automatically
Path B — Native:
# Install dependencies in each relevant repo
pip install -e . # in each repo needed for the tier(s) you are working on
Step-by-step¶
# Run the full validation suite (all enabled tiers)
docker compose run validate
# Run a specific tier
docker compose run validate --tier 2a
# Check results
cat output/validation_report.json | python -m json.tool
# Quick smoke test (verify tools install and data is present)
docker compose run shell python -c "from open_worm_analysis_toolbox import NormalizedWorm; print('OK')"
docker compose run shell ls /opt/openworm/validation/data/
See the Integration Test subsection below for a full step-by-step verification procedure.
How to Visualize¶
Validation results are displayed through the DD014 viewer and standalone reports:
- Correlation matrix heatmaps: Tier 2a produces a simulated 302x302 functional connectivity matrix displayed alongside the Randi 2023 experimental matrix. Difference heatmap highlights neuron pairs with largest discrepancies.
- Locomotion parameter dashboards: Tier 3 produces a bar chart of the 5 kinematic metrics (speed, wavelength, frequency, amplitude, gait) with experimental mean +/-15% tolerance bands.
- Per-neuron expression-consistency overlay: Tier 1 expression-consistency results are rendered as a pass/fail color map over the 3D worm body in the DD014 viewer (
validation/overlay/OME-Zarr group). - Perturbation effect comparison: Tier 4 results are displayed as a table of predicted vs. observed effect directions with magnitude bars.
- CI summary badge: GitHub Actions produces a pass/fail badge linked to the full validation report JSON.
Technical Approach¶
Three-Tier Validation Hierarchy¶
| Tier | What Is Validated | Validation Data | Acceptance Criteria | Blocking? |
|---|---|---|---|---|
| Tier 1: Unit (Single Cell) | Membrane voltage, conductances, calcium dynamics | Goodman et al. 2002 patch-clamp, Randi et al. 2023 single-neuron Ca imaging | Quantitative match within 20% | No (warning) |
| Tier 2a: Integration (Circuit) | Functional connectivity, network dynamics | Randi et al. 2023 whole-brain pairwise correlations (wild-type) | Correlation coefficient > 0.5 vs. experimental | Yes (blocks merge) |
| Tier 2b: Integration (Neuropeptides) | Neuropeptide modulation effect on functional connectivity | Randi et al. 2023 wild-type vs. unc-31 mutant | Neuropeptide contribution correlation r > 0.3 | Yes (after DD006) |
| Tier 3: System (Behavior) | Movement kinematics, pumping, defecation | Yemini et al. 2013 (Schafer lab kinematics), Raizen & Avery 1994 (pharyngeal EPG), Thomas 1990 (defecation) | Statistical match via open-worm-analysis-toolbox | Yes (blocks merge) |
| Tier 4: Causal (Intervention) | Perturbation response: ablation, silencing, mutation | Published laser ablation, optogenetics, mutant phenotype data | Direction of effect matches ≥70%; magnitude within ±30% | No (advisory → blocking Phase 3+) |
Blocking: A PR that degrades Tier 2 or Tier 3 validation scores cannot be merged without explicit founder approval + justification.
Tier 1: Single-Cell Validation (Unit Tests)¶
For each neuron class with published electrophysiology:
Run the cell model in isolation (no synaptic inputs, no network effects) with standard voltage-clamp or current-clamp protocols. Compare:
| Property | Measurement | Typical Acceptance Range |
|---|---|---|
| Resting potential (V_rest) | No current injection | ± 5 mV |
| Input resistance (R_in) | Small current step | ± 30% |
| I-V curve | Voltage ramp | Pearson r > 0.8 |
| Spike threshold | Depolarizing current | ± 10 mV (if applicable) |
| Calcium influx | Depolarization-evoked | ± 40% (noisy measurement) |
Primary Tier 1 datasets:
| Neuron(s) | Dataset | What It Provides | Validation Use |
|---|---|---|---|
| ALM, AVM, PLM (touch receptors) | Goodman et al. 2002, O'Hagan et al. 2005 | Whole-cell patch-clamp: resting potential, I-V curves, MEC-4/DEG-ENaC channel kinetics | Validate touch neuron resting potential, input resistance, mechanoreceptor current amplitude |
| ALM, AVM, PLM (touch receptors) | Suzuki et al. 2003 | In vivo calcium imaging during mechanical stimulation | Validate calcium transient amplitude and kinetics in response to touch |
| AWC (olfactory) | Chalasani et al. 2007 | Calcium imaging with odor presentation, TAX-2/TAX-4 channel characterization | Validate sensory transduction dynamics, OFF-response calcium kinetics |
| ASH (nociceptor) | Hilliard et al. 2005, WormsenseLab_ASH repo | Calcium imaging, OSM-9/TRPV channel characterization | Validate polymodal nociceptor response profile |
| AVA (command interneuron) | Lockery lab (Lindsay et al. 2011) | Whole-cell recordings, graded potential dynamics | Validate command interneuron I-V curve, graded (non-spiking) response |
| RIM (motor/modulatory) | Liu et al. 2018 | Calcium imaging + electrophysiology, EGL-19/UNC-2 channels | Validate motor neuron calcium dynamics, channel conductance ratios |
| Pharyngeal neurons (MC, M3) | Raizen & Avery 1994 | Electropharyngeogram (EPG): extracellular field potentials from pharyngeal muscles and neurons | Validate pharyngeal neuron firing patterns (Phase 3, DD007) |
Coverage: ~7 neuron classes have direct patch-clamp or detailed calcium imaging data suitable for Tier 1 spot-checks. An additional ~13 classes have partial recordings (single-channel data, calcium responses to specific stimuli) curated in the openworm/ChannelWorm ion channel database. See DD005 Calibration Dataset for the full training set.
For the ~121 neuron classes without direct electrophysiology: Tier 1 cannot compare to patch-clamp recordings that don't exist. Instead, we run expression-consistency checks — systematic tests that the model's electrical behavior is consistent with its CeNGEN (Taylor et al. 2021) ion channel expression profile. This catches gross errors (e.g., a model with large calcium currents in a neuron that doesn't express calcium channels) without requiring experimental recordings.
Expression-consistency check: gene → expected electrical property
For each neuron class, DD005 maps CeNGEN expression to conductance densities. The following table defines what each major channel gene predicts about the model's electrical behavior:
| CeNGEN Gene | Channel Type | If Highly Expressed (top quartile) | If Not Expressed (<1 TPM) | Model Check |
|---|---|---|---|---|
| egl-19 | Cav1 (L-type Ca²⁺) | Large sustained calcium current during depolarization; high resting [Ca²⁺] | No L-type calcium current | Inject +20mV step → measure I_Ca amplitude |
| unc-2 | Cav2 (P/Q-type Ca²⁺) | Large transient calcium current; fast synaptic release | No P/Q-type current | Voltage ramp → I-V curve shows Ca²⁺ peak |
| cca-1 | Cav3 (T-type Ca²⁺) | Low-threshold calcium spikes; rebound bursting after hyperpolarization | No rebound activity | Hyperpolarize → release → check for rebound depolarization |
| shl-1 | Kv4 (A-type K⁺) | Fast transient outward current; delays depolarization onset | No A-type current | Depolarize from -80mV → measure transient K⁺ peak |
| shk-1 | Kv1 (delayed rectifier K⁺) | Sustained outward current; limits depolarization duration | No sustained K⁺ current | Sustained depolarization → measure steady-state K⁺ current |
| unc-103 | Kir (inward rectifier) | Inward current at hyperpolarized potentials; stabilizes resting potential | No inward rectification | I-V curve shows inward current below -80mV |
| twk-18 | TWIK (two-pore leak K⁺) | Low input resistance; hyperpolarized resting potential | High input resistance | Measure R_in and V_rest |
| osm-9 | TRPV (mechanosensory) | Mechanically-gated inward current (sensory neurons only) | No mechanosensory response | Only in ASH, AWA, etc. — check for presence/absence |
Systematic validation procedure:
-
Rank channels by expression. For each neuron class, sort its ion channel genes by CeNGEN TPM (transcripts per million). The top 3 expressed channels define the neuron's expected "electrical fingerprint."
-
Run the model. Simulate each neuron class in isolation with a standard voltage-clamp protocol (ramp from -100mV to +40mV, 200ms).
-
Extract current contributions. Measure the peak current carried by each channel type in the model.
-
Check rank-order consistency. The model's current ranking should match the expression ranking:
- If CeNGEN says
egl-19 >> shl-1 >> unc-2for neuron X, then the model's L-type Ca²⁺ current should be larger than its A-type K⁺ current, which should be larger than its P/Q-type Ca²⁺ current. - Rank-order correlation (Spearman) between expression and model current magnitudes should be positive.
- If CeNGEN says
-
Check qualitative predictions. Verify the binary checks from the table above:
- Gene not expressed (<1 TPM) → corresponding current is absent (<1% of total)
- Gene highly expressed (top quartile) → corresponding current is present and substantial (>10% of total)
Acceptance criteria (expression-consistency):
- Rank-order correlation: Spearman ρ > 0.5 between CeNGEN expression rank and model current rank, averaged across all 128 neuron classes
- Absence check: For genes with <1 TPM expression, the corresponding model current must be <1% of total current in ≥95% of cases
- Presence check: For genes in the top quartile of expression, the corresponding model current must be >10% of total current in ≥80% of cases
- Zero known violations: No neuron class should have its dominant current type contradicted by CeNGEN (e.g., a neuron dominated by L-type Ca²⁺ current that doesn't express egl-19)
Testing command:
# Run expression-consistency validation across all 128 neuron classes
python scripts/validate_expression_consistency.py \
--cell_models cells/*.cell.nml \
--cengen_expression data/CeNGEN_L4_expression.csv \
--gene_channel_map data/gene_to_channel_map.csv \
--output validation_report_tier1_consistency.json
# Output: per-neuron rank correlation, absence/presence checks, violations
This is non-blocking because (a) the expression→conductance calibration (DD005) is approximate, (b) post-transcriptional regulation means mRNA ≠ protein ≠ membrane conductance, and (c) some channel genes have poorly characterized kinetics. But it catches the most common failure mode: a calibration error that gives a neuron the wrong dominant current type.
Example (AVA neuron validation — direct electrophysiology):
# Run isolated AVA model (one of ~7 neurons with patch-clamp data)
python c302/test_single_cell.py --cell AVACell --protocol voltage_clamp
# Compare to Lockery lab data (Lindsay et al. 2011)
python scripts/validate_single_cell_electrophys.py \
--simulated AVA_voltage_clamp.csv \
--experimental data/electrophysiology/AVA_lockery_vclamp.csv \
--output validation_report_AVA.html
Outcome: For neurons with electrophysiology: parameter-by-parameter comparison report. If >2 parameters fail (exceed acceptance range), flag for review. For all 128 neurons: expression-consistency report with rank correlations and violation flags.
Tier 2: Circuit-Level Validation (Integration Tests)¶
Primary target: Randi et al. 2023 functional connectivity matrix (pairwise calcium signal correlations for all 302 neurons during spontaneous activity). Available via wormneuroatlas API and also integrated into the ConnectomeToolbox (cect package) as one of five connectivity modalities (anatomical, contactome, neurotransmitter, extrasynaptic, functional).
Tier 2a: Whole-Network Functional Connectivity¶
Validation procedure:
- Run c302 simulation for 60 seconds (spontaneous activity, no stimulus)
- Extract calcium time series for all neurons
- Compute pairwise Pearson correlations → 302×302 matrix
- Compare to Randi et al. experimental 302×302 matrix
- Metric: Correlation of correlations (Pearson r between simulated and experimental matrices, flattened to vectors)
Acceptance criterion:
- r > 0.5 between simulated and experimental functional connectivity
- At least 70% of neuron pairs have correlation sign agreement (both positive, both negative, or both near-zero)
Testing command:
# Run functional connectivity validation
python scripts/validate_functional_connectivity.py \
--model c302_C1_Differentiated \
--duration 60 \
--experimental_data data/randi2023_functional_connectivity.npy \
--output func_conn_validation.json
# Check if acceptance criteria pass
python scripts/check_validation_criteria.py func_conn_validation.json
Blocking: If this test fails (r < 0.5), the PR cannot merge to main.
PCA structure validation (additional Tier 2 metric): Beyond pairwise correlation matching, the low-dimensional dynamical structure of the neural network should be validated. Kato et al. (2015) showed that PCA of whole-brain calcium activity reveals a dominant mode (PC1) that separates forward-locomotion neurons (AVB, PVC, VB, DB classes) from backward-locomotion neurons (AVA, AVD, VA, DA classes). After synaptic weight optimization (DD001), simulated membrane potential time series should reproduce this PC1 separation. Zhao et al. (2024) demonstrated this validation approach on a 136-neuron circuit; OpenWorm will apply it to the full 302-neuron network.
Tier 2b: Neuropeptide Modulation Validation (unc-31 Natural Experiment)¶
Purpose: Validate that DD006 (neuropeptide modulation) produces the correct effect on functional connectivity.
The natural experiment: UNC-31 is the CAPS protein required for dense-core vesicle fusion — the mechanism by which neuropeptides are released. The unc-31 mutant has intact synaptic transmission but no neuropeptide signaling. Randi et al. 2023 measured functional connectivity for both wild-type and unc-31 mutant strains. The difference between the two matrices isolates the neuropeptide contribution to neural dynamics.
Validation procedure:
| Experimental (Randi 2023) | Simulated | |
|---|---|---|
| With neuropeptides | fc_wt (wild-type) |
sim_fc_on (DD006 enabled) |
| Without neuropeptides | fc_unc31 (unc-31 mutant) |
sim_fc_off (DD006 disabled) |
| Neuropeptide contribution | fc_diff_exp = fc_wt - fc_unc31 |
sim_fc_diff = sim_fc_on - sim_fc_off |
Acceptance criterion:
- Correlation between
fc_diff_expandsim_fc_diff(flattened) r > 0.3 - This is a weaker threshold than Tier 2a (r > 0.5) because the difference signal is smaller and noisier than the absolute functional connectivity
Testing command:
# Run unc-31 validation (requires two simulation runs)
python scripts/validate_neuropeptide_fc.py \
--model_with_neuropeptides c302_C1_DD006_enabled \
--model_without_neuropeptides c302_C1_DD006_disabled \
--duration 60 \
--output neuropeptide_fc_validation.json
# Check acceptance
python scripts/check_validation_criteria.py neuropeptide_fc_validation.json
Data access:
from wormneuroatlas import NeuroAtlas
atlas = NeuroAtlas()
fc_wt = atlas.get_signal_propagation_atlas(strain="wt")
fc_unc31 = atlas.get_signal_propagation_atlas(strain="unc31")
fc_diff_exp = fc_wt - fc_unc31 # Neuropeptide contribution (experimental)
Blocking: This sub-test becomes blocking after DD006 is implemented (Phase 2). Before DD006, it is informational only.
Cross-reference: See DD006 §Validation for the full neuropeptide validation methodology, which uses this same unc-31 comparison as its Tier 1 functional connectivity validation.
Tier 3: Behavioral Validation (System Tests)¶
Primary tool: open-worm-analysis-toolbox (see DD021 for toolbox revival plan, WCON format specification, API contract, and version pinning) — compares simulated movement trajectories to Yemini et al. 2013 Schafer lab experimental data in WCON format.
Validated metrics:
- Speed: Mean forward velocity (µm/s)
- Wavelength: Body bend wavelength (µm)
- Frequency: Undulation frequency (Hz)
- Amplitude: Body bend amplitude (degrees)
- Crawl/swim classification: Behavioral mode based on gait
Acceptance criteria:
- All 5 metrics within ±15% of experimental mean
- Movement trajectory visually resembles real worm (qualitative check)
Testing command:
# Run behavioral validation suite
cd open-worm-analysis-toolbox/
python validate_movement.py \
--simulated ../c302/output/worm_trajectory.wcon \
--experimental data/schafer_baseline_N2.wcon \
--output validation_report.json
# Check pass/fail
python check_acceptance.py validation_report.json --tolerance 0.15
Additional behavioral tests:
- Pharyngeal pumping: 3-4 Hz (DD007)
- Defecation cycle: 50 ± 10 seconds period (DD009)
- Reversal initiation: Response to aversive stimulus (<1 second latency)
Blocking: If movement validation degrades by >15%, the PR is blocked.
Tier 4: Causal / Interventional Validation (Non-Blocking, Advisory)¶
Rationale: Tiers 1-3 validate against observational data — recordings from intact, unperturbed animals. However, a model that reproduces normal behavior may do so for the wrong mechanistic reasons (parameter compensation, degenerate solutions). As Pearl & Mackenzie (2018) argue in their framework for causal inference, observational data alone cannot distinguish correlation from causation. To establish that the model captures true causal relationships between neurons, we need to validate against interventional data — experiments where specific neurons are ablated, silenced, or activated, and the resulting changes in neural activity and behavior are measured.
Validation data sources (published):
| Intervention | Organism Response | Data Source |
|---|---|---|
| Touch neuron ablation (ALM, AVM, PLM) | Loss of gentle touch response | Chalfie et al. (1985), J Neurosci 5:956-964 |
| Pharyngeal neuron laser killing | Pumping continues (semi-autonomous organ) | Avery & Horvitz (1989), Neuron 3:473-485 |
| Optogenetic activation of specific neurons | Stimulus-specific behavioral responses | Leifer et al. (2011), Nat Methods 8:147-152 |
| unc-2 (Cav2) loss of function | Reduced locomotion speed | Schafer lab WCON mutant data |
| egl-1, unc-103 loss of function | Egg-laying phenotypes | Trent et al. (1983), Collins & Koelle (2013) |
| flp peptide knockouts | Altered locomotion patterns | Rogers et al. (2003), Li et al. (1999) |
Validation procedure: Simulate the specific perturbation (zero out a neuron's output, remove a channel type, delete a peptide gene) and compare the resulting behavioral change to published experimental data. The model should predict the direction of the effect (faster/slower, more/fewer reversals) and ideally the magnitude within 30%.
Acceptance criteria: - Direction of effect matches experimental observation for ≥70% of tested perturbations - Magnitude within ±30% for well-characterized perturbations (e.g., touch neuron ablation latency, unc-2 speed reduction) - Model does not predict catastrophic failure (NaN, divergence) for perturbations that produce viable animals in vivo
Status: Non-blocking (advisory) in Phase 1-2. Becomes blocking in Phase 3+ as more subsystems come online and the model makes increasingly specific causal predictions.
Note: A growing body of whole-brain perturbation data is being collected by multiple labs using optogenetic stimulation paired with whole-brain imaging across thousands of animals (Randi et al. 2023; Haspel et al. 2023). As these datasets become publicly available, they will provide increasingly powerful Tier 4 validation targets. See DD024 (Validation Data Acquisition) for the data sourcing roadmap.
Standard In-Silico Perturbation Battery¶
Zhao et al. (2024) demonstrated several informative in-silico perturbation experiments that reveal how network structure shapes dynamics and behavior. OpenWorm should formalize these as a standard perturbation battery that every model version is tested against:
| Perturbation | Expected Effect | Experimental Basis |
|---|---|---|
| Remove all gap junctions | Greater disruption to correlation matrix than removing chemical synapses | Zhao et al. 2024 Fig. 10D-G; Randi et al. 2023 unc-7 data |
| Remove neurites (soma-only model) | Higher body twisting, degraded forward locomotion | Zhao et al. 2024 Fig. 10B |
| Shuffle synapse locations on neurites | Faster head/tail oscillation, slower forward speed | Zhao et al. 2024 Fig. 10C |
| Ablate AVA bilaterally | Loss of backward locomotion command | Chalfie et al. 1985 |
| Silence all B-class motor neurons | Loss of forward locomotion | Zheng et al. 1999 |
| Block Cav2 (unc-2 null) | Reduced locomotion speed | Schafer lab mutant data |
These perturbation experiments serve dual purposes: (a) validation that the model responds correctly to interventions, and (b) scientific discovery — any unexpected model response identifies a gap in understanding.
Statistical Grounding for Acceptance Thresholds¶
The ±15% tolerance used in Tier 3 is grounded in measured inter-animal variability. Yemini et al. (2013) compiled a database of C. elegans behavioral phenotypes from thousands of tracked animals and found that wild-type (N2) locomotion metrics typically exhibit coefficients of variation (CV) in the range of 15-25% for speed, body bend amplitude, and wavelength. A model that matches the experimental mean within one CV is performing within the biological noise floor — tighter matching would be overfitting to a specific animal rather than capturing the population behavior.
For Tier 2 (functional connectivity), the r > 0.5 threshold reflects the observation that calcium correlation matrices from independent recording sessions of the same genotype show inter-session correlations in the range of r = 0.6-0.8 (Randi et al. 2023). A model achieving r > 0.5 is thus approaching the reproducibility ceiling of the experimental data itself.
Behavioral Quantification Methods¶
Tiers 3 and 4 require robust, unbiased behavioral quantification. The field of computational neuroethology has developed systematic approaches to this challenge:
- Unsupervised behavioral decomposition (Berman et al. 2014) identifies stereotyped behavioral motifs from continuous recordings without pre-defined categories, enabling discovery of behavioral states that the model should reproduce
- Deep learning-based pose estimation (Pereira et al. 2022, SLEAP) provides sub-pixel body posture tracking that can extract kinematic features more precisely than centroid-only approaches
- Computational neuroethology frameworks (Datta et al. 2019) advocate for treating behavior as a high-dimensional continuous signal rather than a set of discrete categories, which aligns with how our simulation outputs movement data
As the validation toolbox (DD021) is revived, it should incorporate or interface with these modern approaches rather than relying solely on the classic 5-metric kinematic comparison.
Alternatives Considered¶
1. No Quantitative Validation (Visual Inspection Only)¶
Rejected: "It looks right" is subjective. Quantitative metrics enable regression detection and objective comparison between models.
2. Single Validation Level (Behavior Only)¶
Rejected: A model can produce correct movement for the wrong reasons (parameter compensation). Multi-level validation (electrophysiology + connectivity + behavior) ensures mechanistic correctness.
3. Strict Thresholds (Must Match Exactly)¶
Rejected: Biological data have measurement noise and animal-to-animal variability. ±15-20% tolerance accounts for this. Exact matches are neither achievable nor necessary.
4. Single-Metric Validation (One Number to Rule Them All)¶
Rejected: A single aggregate score (e.g., overall behavioral similarity) allows degenerate solutions — a model can score well by excelling on one metric while failing others. Multi-metric validation across all four tiers prevents this.
5. Threshold-Free Qualitative Assessment¶
Rejected: Without explicit numeric thresholds, "good enough" becomes subjective and shifts over time. Quantitative acceptance criteria (r > 0.5, +/-15%, etc.) make pass/fail decisions objective, reproducible, and enforceable in CI.
Quality Criteria¶
-
Automated Test Suite: All validation tests must be runnable via CI/CD without manual intervention.
-
Regression Detection: Every PR that modifies cell models, connectome, or physics must run the validation suite. Report before/after comparison.
-
Versioned Experimental Data: Validation datasets must be versioned and archived (e.g.,
data/randi2023_v1.0/). Do not overwrite. -
Pass/Fail Criteria Documented: Each test must have explicit acceptance criteria (e.g., "r > 0.5," "period = 50 ± 10 s") in the test script, not tribal knowledge.
Integration Contract¶
Inputs (What This Subsystem Consumes)¶
| Input | Source DD | Variable | Format | Units |
|---|---|---|---|---|
| Neuron calcium time series | DD001 | Per-neuron [Ca²⁺] over time | Tab-separated *_calcium.dat |
mol/cm³ |
| Single-cell electrophysiology | DD001 | V, I_Ca, I_K per cell | Tab-separated from NEURON | mV, nA |
| Movement trajectory | DD003 | Body centroid + posture over time | WCON file | µm, frames |
| Pharyngeal pumping state | DD007 | Per-section contraction time series | Tab-separated | binary or [0,1] |
| Defecation motor program | DD009 | pBoc/aBoc/Exp timestamps | Event log | ms |
| Experimental data (electrophysiology) | DD008 / published papers | Patch-clamp recordings | CSV | mV, nA |
| Experimental data (functional connectivity, wild-type) | wormneuroatlas API / Randi 2023 |
302×302 correlation matrix | NumPy .npy |
dimensionless |
| Experimental data (functional connectivity, unc-31) | wormneuroatlas API / Randi 2023 |
302×302 correlation matrix (no neuropeptide release) | NumPy .npy |
dimensionless |
| Neuropeptide-on/off simulation outputs | DD006 | Per-neuron [Ca²⁺] with DD006 enabled vs. disabled | Tab-separated *_calcium.dat |
mol/cm³ |
| Experimental data (kinematics) | DD008 / Yemini et al. 2013 (Schafer lab) | Movement trajectories | WCON | µm |
| Experimental data (defecation) | DD008 / Thomas 1990 | Defecation cycle periods | CSV | seconds |
| Experimental data (pumping) | DD008 / Raizen 1994 | EPG recordings | CSV | mV |
Outputs (What This Subsystem Produces)¶
| Output | Consumer DD | Variable | Format | Units |
|---|---|---|---|---|
| Tier 1 validation report | DD012 (PR review) | Per-cell pass/fail + metrics | JSON | mixed |
| Tier 2 validation report | DD012 (PR review), DD013 (CI gate) | Correlation-of-correlations score | JSON | dimensionless (r value) |
| Tier 3 validation report | DD012 (PR review), DD013 (CI gate) | Per-metric pass/fail (speed, wavelength, frequency, amplitude, gait) | JSON | mixed |
| Regression alert | DD013 (CI pipeline) | Pass/fail + diff from baseline | JSON + exit code | boolean |
| Validation dashboard | Mad-Worm-Scientist (daily digest) | Summary metrics for all tiers | JSON | mixed |
| Validation overlay data (for viewer) | DD014 (visualization) | Per-metric pass/fail + experimental comparison traces | OME-Zarr: validation/overlay/ (tier results + reference data) |
mixed |
CI/CD Ownership Split (DD010 vs. DD013)¶
DD010 defines WHAT to validate. DD013 defines HOW to run it in Docker and CI.
| Responsibility | Owned By |
|---|---|
| Validation metrics, acceptance criteria, test scripts | DD010 |
Docker compose services (quick-test, validate) |
DD013 |
| CI/CD pipeline (GitHub Actions workflow) | DD013 |
| Validation data packaging in Docker image | DD010 + DD013 (shared) |
| Pass/fail decision logic (blocking PRs) | DD010 (criteria) + DD013 (enforcement) |
Reconciliation: The docker compose run validate service (DD013) runs the validation scripts defined by DD010. The scripts produce JSON reports. DD013's CI pipeline reads those reports and applies DD010's acceptance criteria to determine pass/fail.
Configuration (openworm.yml Section)¶
validation:
run_after_simulation: false # Set true for CI; false for interactive use
tier1_electrophysiology: false # Single-cell validation (requires specific cell models)
tier2_functional_connectivity: false # Tier 2a: Circuit-level (requires 60s sim)
tier2_neuropeptide_unc31: false # Tier 2b: unc-31 comparison (requires 2×60s sim, DD006)
tier3_behavioral: false # Movement kinematics (requires ~5s sim)
tier3_pumping: false # Pharyngeal pumping (requires pharynx.enabled + ~5s sim)
tier3_defecation: false # Defecation cycle (requires intestine.enabled + ~200s sim)
acceptance_criteria:
tier2_min_correlation: 0.5 # Minimum r for functional connectivity
tier3_max_deviation: 0.15 # Maximum deviation from experimental mean (±15%)
tier3_pumping_range: [3.0, 4.0] # Hz
tier3_defecation_range: [40, 60] # seconds
Default vs. CI configuration:
# configs/validation_full.yml (used by CI)
validation:
run_after_simulation: true
tier2_functional_connectivity: true
tier3_behavioral: true
Docker Build¶
- Repository:
openworm/open-worm-analysis-toolbox(movement validation, see DD021) +openworm/tracker-commons(WCON spec, see DD021) + validation scripts inopenworm/OpenWormmeta-repo - Docker stage:
validationin multi-stage Dockerfile versions.lockkeys:open_worm_analysis_toolbox,tracker_commons(both managed per DD021)- Build dependencies:
pip install open-worm-analysis-toolbox+ validation data files
Validation Data Location¶
All validation datasets are baked into the Docker image at build time (not downloaded at runtime):
/opt/openworm/validation/data/
├── electrophysiology/
│ ├── goodman1998_touch_neurons.csv
│ ├── lockery_AVA_recordings.csv
│ └── README.md (sources, DOIs, licenses)
├── functional_connectivity/
│ ├── randi2023_wt_matrix.npy
│ ├── randi2023_unc31_matrix.npy
│ ├── randi2023_metadata.json
│ └── README.md
├── kinematics/
│ ├── schafer_N2_baseline.wcon
│ ├── schafer_unc2_mutant.wcon
│ └── README.md
└── behavioral/
├── thomas1990_defecation.csv
├── raizen1994_pumping_EPG.csv
└── README.md
Licensing requirement: All validation data must be openly accessible (CC-BY or equivalent). Each directory includes a README with source DOIs and licenses.
Code Reuse: wormneuroatlas and ConnectomeToolbox for Tier 2 Validation¶
Two existing OpenWorm packages provide all the experimental data needed for Tier 2 validation — no manual data extraction required:
1. wormneuroatlas — Functional connectivity matrices from Randi 2023
- Repository:
openworm/wormneuroatlas(pushed 2025-10-22, maintained) - Installation:
pip install wormneuroatlas
from wormneuroatlas import NeuroAtlas
atlas = NeuroAtlas()
# Tier 2a: Wild-type functional connectivity
fc_wt = atlas.get_signal_propagation_atlas(strain="wt")
# Returns: 302×302 correlation matrix (exactly what Tier 2a needs)
# Tier 2b: unc-31 mutant functional connectivity (no neuropeptide release)
fc_unc31 = atlas.get_signal_propagation_atlas(strain="unc31")
# Returns: 302×302 correlation matrix without neuropeptide modulation
# Tier 2b: Neuropeptide contribution = difference
fc_neuropeptide_contribution = fc_wt - fc_unc31
Both wild-type and unc-31 datasets are production-ready. No manual download from Nature supplement needed — the package handles data access, versioning, and neuron ID normalization.
2. ConnectomeToolbox (cect) — Unified connectivity data across five modalities
- Repository:
openworm/ConnectomeToolbox(Gleeson et al., in preparation) - Installation:
pip install cect
The ConnectomeToolbox aggregates C. elegans connectivity data into a unified API with five modalities: anatomical, contactome, neurotransmitter atlases, extrasynaptic (neuropeptidergic — Ripoll-Sánchez 2023, Bentley 2016, Pereira 2015, Beets 2022), and functional (Randi 2023). For Tier 2 validation, the functional connectivity modality provides an alternative access path to the same Randi 2023 data:
from cect import ConnectomeDataset
# Access functional connectivity via cect
functional = ConnectomeDataset("Randi2023")
Recommendation: Use wormneuroatlas directly for Tier 2 validation (more mature API for functional connectivity matrices, strain-specific access). Use cect when you need structural + functional connectivity together (e.g., comparing structural predictions to functional observations).
Testing:
pip install wormneuroatlas
python -c "
from wormneuroatlas import NeuroAtlas
atlas = NeuroAtlas()
fc_wt = atlas.get_signal_propagation_atlas(strain='wt')
fc_unc31 = atlas.get_signal_propagation_atlas(strain='unc31')
print(f'Wild-type FC: {fc_wt.shape}')
print(f'unc-31 FC: {fc_unc31.shape}')
print(f'Neuropeptide contribution matrix: {(fc_wt - fc_unc31).shape}')
"
# Expected: (302, 302) for all three
Action Items:
- [ ] Add
wormneuroatlasto DD013 Docker validation stage - [ ] Add
cectto DD013 Docker validation stage - [ ] Pin versions for both in
versions.lock - [ ] Update Tier 2a validation scripts to use wormneuroatlas API
- [ ] Implement Tier 2b (unc-31 comparison) validation script after DD006
Estimated Time Savings: 15-20 hours (no manual data extraction, both APIs are production-ready)
open-worm-analysis-toolbox Revival (DD021)¶
This repo is dormant (last commit Jan 2020). DD021 (Movement Analysis Toolbox and WCON Policy) owns the full revival plan, including:
- 8-task revival roadmap with owners, effort estimates, and dependencies (~33 hours total)
- Python 3.12 compatibility, dependency updates, test suite fixes
- WCON 1.0 format pinning from
tracker-commons - Docker
validationstage andversions.lockentries - API contract for
NormalizedWormandWormFeaturesclasses - Relationship to Tierpsy Tracker (modern successor)
See DD021 for the complete revival plan. This is a Phase A (DD013 roadmap) task. Without a working analysis toolbox, Tier 3 validation is impossible.
Note: The archived predecessor repo openworm/movement_validation should not be used — it was superseded by the analysis toolbox.
Integration Test¶
# Step 1: Verify validation tools install
docker compose run shell python -c "
from open_worm_analysis_toolbox import NormalizedWorm, WormFeatures
print('Analysis toolbox loaded successfully')
"
# Step 2: Verify validation data is present
docker compose run shell ls /opt/openworm/validation/data/
# Must show: electrophysiology/, functional_connectivity/, kinematics/, behavioral/
# Step 3: Run validation suite (after simulation)
docker compose run validate
# Verify: output/validation_report.json exists
# Verify: report contains tier2 and tier3 sections
# Verify: CI exit code = 0 if all tiers pass, non-zero if any blocking tier fails
# Step 4: Test regression detection
# Modify a parameter known to degrade locomotion
# Run validation
# Verify: Tier 3 fails with specific metric(s) identified
Coupling Dependencies¶
| I Depend On | DD | What Breaks If They Change |
|---|---|---|
| Neural output format | DD001 | If calcium time series format changes, Tier 1 and Tier 2 validators can't read data |
| Neuropeptide on/off toggle | DD006 | If DD006 enable/disable mechanism changes, Tier 2b (unc-31) validation can't run paired simulations |
| Movement output format | DD003 | If WCON format or particle output changes, Tier 3 movement validator breaks |
| Pharyngeal output format | DD007 | If pumping state format changes, pumping validation breaks |
| Intestinal output format | DD009 | If defecation event format changes, defecation validation breaks |
| Experimental data (OWMeta) | DD008 | If data provenance or versioning changes, validation baselines may shift |
| Docker compose services | DD013 | If validate service configuration changes, validation pipeline breaks |
| Depends On Me | DD | What Breaks If I Change |
|---|---|---|
| CI pipeline (blocking gates) | DD013 | If acceptance criteria change, CI may pass/fail differently |
| PR review (Mind-of-a-Worm) | DD012 | Mind-of-a-Worm references DD010 criteria when checking PR compliance |
| Founder digest (Mad-Worm-Scientist) | AI Agents | If validation report format changes, Mad-Worm-Scientist can't parse regression alerts |
| All subsystem DDs | DD001-DD009 | If a tier's acceptance criteria tighten, previously-passing subsystems may now fail |
Implementation References¶
Open-Worm-Analysis-Toolbox¶
Repository:
https://github.com/openworm/open-worm-analysis-toolbox
Note: The predecessor repo
openworm/movement_validationis archived and should not be used. The analysis toolbox is the current, canonical implementation. See DD021 for full repository landscape and revival plan.
Key modules:
movement_validation/— Statistical feature extraction from WCON filescomparison/— Compare simulated vs. realwcon_parser/— WCON format handling
Usage:
from open_worm_analysis_toolbox import NormalizedWorm, WormFeatures
# Load experimental data
exp_worm = NormalizedWorm.from_schafer_file("schafer_N2_baseline.wcon")
exp_features = WormFeatures(exp_worm)
# Load simulated data
sim_worm = NormalizedWorm.from_simulation("c302_output.wcon")
sim_features = WormFeatures(sim_worm)
# Compare
comparison = exp_features.compare(sim_features)
print(comparison.summary()) # Pass/fail report
Validation Data Repository¶
openworm/validation_data/
├── electrophysiology/
│ ├── goodman1998_touch_neurons.csv
│ ├── lockery_AVA_recordings.csv
│ └── ...
├── functional_connectivity/
│ ├── randi2023_wt_matrix.npy
│ ├── randi2023_unc31_matrix.npy
│ ├── randi2023_metadata.json
│ └── ...
├── kinematics/
│ ├── schafer_N2_baseline.wcon
│ ├── schafer_unc2_mutant.wcon # Calcium channel mutant
│ └── ...
└── behavioral/
├── thomas1990_defecation.csv
├── raizen1994_pumping_EPG.csv
└── ...
Licensing: All validation data must be openly accessible (CC-BY or equivalent). Cite original publications.
Migration Path¶
From Manual Validation to Automated CI¶
Current state: Validation is run manually before major releases.
Target state (Phase 1): GitHub Actions CI runs validation suite on every PR to main.
Implementation:
# .github/workflows/validation.yml
name: Model Validation
on: [pull_request]
jobs:
tier2_functional_connectivity:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Generate c302 network
run: python c302/CElegans.py C1Differentiated
- name: Run simulation
run: jnml LEMS_c302_C1_Differentiated.xml -nogui
- name: Validate vs. Randi 2023
run: python scripts/validate_functional_connectivity.py
- name: Check acceptance
run: python scripts/check_criteria.py --min_correlation 0.5
tier3_behavioral:
runs-on: ubuntu-latest
steps:
- name: Run movement validation
run: cd open-worm-analysis-toolbox && python validate_movement.py
- name: Check acceptance
run: python scripts/check_criteria.py --max_deviation 0.15
Boundaries (Explicitly Out of Scope)¶
- Developmental validation: Validating stage-specific models (L1, dauer, male) is Phase 6 work.
- Genetic variation: Validating against natural isolates (Ben-David eQTLs) is Phase 6+ work.
- Pharmacological validation: Drug effects (aldicarb, levamisole) are future work.
Existing Code Resources¶
wormneuroatlas (openworm/wormneuroatlas, PyPI: pip install wormneuroatlas, maintained 2025):
Provides direct API access to Randi 2023 functional connectivity via NeuroAtlas.get_signal_propagation_atlas(strain="wt"), returning the exact 302x302 correlation matrix needed for Tier 2 validation. No manual data download required. Estimated time savings: 15 hours.
neuronal-analysis (openworm/neuronal-analysis, 2017, dormant): Tools to produce, analyse and compare simulated and recorded neuronal datasets — directly relevant to Tier 1 electrophysiology validation. May contain reusable single-cell comparison scripts.
owmeta-sciunit (openworm/owmeta-sciunit, 2021): OWMeta-integrated SciUnit types providing formalized Tier 1 single-cell validation test classes with Z-scores, pass/fail, and goodness-of-fit metrics. Recommended tooling for automating Tier 1 electrophysiology validation in CI (DD013).
worm-functional-connectivity (openworm/worm-functional-connectivity, 2023): Alternative/supplementary source for Tier 2 functional connectivity matrices. Check if it includes unc-31 neuropeptide-deficient mutant data alongside wild-type.
NicolettiEtAl2024_MN_IN + NicolettiEtAl2019_NeuronModels (openworm/NicolettiEtAl2024_MN_IN, openworm/NicolettiEtAl2019_NeuronModels): Published HH parameter fits for motor neurons, interneurons, AWCon, and RMD. Expand the Tier 1 calibration set beyond the current ~20 neurons.
References¶
- Sarma et al. 2016 — "Unit testing, model validation, and biological simulation." F1000Research 5:1946.
- Randi et al. 2023 — "Neural signal propagation atlas of Caenorhabditis elegans." Nature 623:406-414.
- Yemini et al. 2013 — "A database of Caenorhabditis elegans behavioral phenotypes." Nature Methods 10:877-879.
- Goodman et al. 2002 — "Active currents regulate sensitivity and dynamic range in C. elegans neurons." Nature 415:1039-1042.
- Raizen & Avery 1994 — "Electrical activity and behavior in the pharynx of Caenorhabditis elegans." Neuron 12:483-495.
- Thomas 1990 — "The defecation motor program of Caenorhabditis elegans." Genetics 124:855-872.
- Gleeson et al., in preparation — "ConnectomeToolbox: a unified software framework for C. elegans connectivity data." (Manuscript in preparation;
cectPython package published.) - Ripoll-Sánchez et al. 2023 — "The neuropeptidergic connectome of C. elegans." Neuron 111:3570-3589. (Extrasynaptic connectivity data in ConnectomeToolbox.)
- Pereira et al. 2015 — "A cellular and regulatory map of the cholinergic nervous system of C. elegans." eLife 4:e12432. (Peptide co-expression data in ConnectomeToolbox.)
- Pearl J, Mackenzie D (2018). The Book of Why: The New Science of Cause and Effect. Basic Books (ISBN: 978-0465097609). Theoretical framework for causal inference — observational data is insufficient for validating causal models; interventional data (perturbations) is required.
- Berman GJ, Choi DM, Bialek W, Shaevitz JW (2014). "Mapping the stereotyped behaviour of freely moving fruit flies." J R Soc Interface 11:20140672. Unsupervised behavioral decomposition — systematic approach to identifying behavioral motifs from continuous recordings.
- Pereira TD et al. 2022 — "SLEAP: A deep learning system for multi-animal pose estimation." Nature Methods 19:486-495. Deep learning pose estimation for high-precision behavioral quantification.
- Datta SR, Anderson DJ, Branson K, Perona P, Leifer A (2019). "Computational neuroethology: a call to action." Neuron 104:11-24. Framework for treating behavior as a high-dimensional continuous signal — relevant to how we quantify simulated vs. real movement.
- Chalfie M, Sulston JE, White JG, Southgate E, Thomson JN, Brenner S (1985). "The neural circuit for touch sensitivity in Caenorhabditis elegans." J Neurosci 5:956-964. Foundational touch neuron ablation data — Tier 4 causal validation target.
- Haspel G et al. (2023). "To reverse engineer an entire nervous system." arXiv [q-bio.NC] 2308.06578. White paper on observational and perturbational completeness in C. elegans neuroscience — motivates Tier 4 validation.
- Kato S, Kaplan HS, Schrodel T, Skora S, Lindsay TH, Yemini E, Lockery S, Zimmer M (2015). "Global brain dynamics embed the motor command sequence of Caenorhabditis elegans." Cell 163:656-669. Whole-brain calcium imaging showing PCA structure of neural dynamics — PC1 separates forward vs. backward locomotion command neurons.
- Zhao M, Wang N, Jiang X, et al. (2024). "An integrative data-driven model simulating C. elegans brain, body and environment interactions." Nature Computational Science 4(12):978-990. MetaWorm model — 136-neuron circuit with neurite-level spatial detail, demonstrates PCA validation, gap junction perturbation, and closed-loop chemotaxis.
- Approved by: OpenWorm Steering
-
Implementation Status: Partial
-
Tier 1 (single-cell electrophysiology): Scripts exist but not automated (non-blocking currently)
- Tier 2a (functional connectivity): Randi 2023 data accessible via
wormneuroatlasAPI — no manual ingestion needed (blocking) - Tier 2b (neuropeptide unc-31 comparison): Randi 2023 unc-31 data also in
wormneuroatlas— blocked on DD006 implementation (Phase 2) - Tier 3 (behavioral kinematics): BLOCKED —
open-worm-analysis-toolboxis dormant (last commit Jan 2020, broken on Python 3.12)
See DD021 (Movement Analysis Toolbox and WCON Policy) for the complete toolbox revival plan (8 tasks, ~33 hours). Tier 3 validation cannot run until the toolbox is revived and installable on Python 3.12.
Next Actions:
- URGENT: Prioritize DD021 toolbox revival as Phase A work (parallel with DD013)
- Appoint Validation L4 Maintainer to own revival (see ClickUp task 868hjdzqy)
- Add
wormneuroatlas+cectto Docker validation stage — Randi 2023 data is already accessible via API (no manual ingestion needed) - After DD006: Implement Tier 2b (unc-31 comparison) validation script
- After DD013: Implement Steps 4-5 in
master_openworm.py(validation pipeline) - Set up GitHub Actions CI with Tier 2a+2b+3 blocking gates