Skip to content

DD024: Validation Data Acquisition Pipeline

  • Status: Proposed (Phase A — Infrastructure)
  • Author: OpenWorm Core Team
  • Date: 2026-02-21
  • Supersedes: None
  • Related: DD010 (Validation Framework), DD008 (Data Integration Pipeline), DD020 (Connectome Data Access), DD013 (Simulation Stack), DD021 (Movement Analysis Toolbox)

Phase: Phase 1: Core Neural Platform | Layer: Data Infrastructure

Quick Action Reference

Question Answer
What does this produce? Version-controlled repository of all experimental datasets needed for DD010 validation across tiers 1-4 and all subsystem DDs
Success metric Every DD010 validation test can run against locally cached, versioned data without requiring external API calls at runtime
Repository openworm/validation-data (new repo) — issues labeled dd024
Config toggle validation.data_path: /opt/openworm/validation/data/ in openworm.yml
Build & test docker compose run shell python scripts/verify_validation_data.py — checks all datasets present, checksums match
Visualize N/A (data infrastructure, not a model)
CI gate Docker build fails if any required dataset is missing or has incorrect checksum

TL;DR

Every subsystem DD (DD001-DD009, DD018-DD019) specifies validation targets that depend on published experimental data, but no DD owns the systematic acquisition, formatting, and version control of that data. This DD fills that gap. It catalogs every dataset referenced by DD010's four validation tiers, defines how each is acquired (API, supplement download, manual digitization), what format it is stored in, and where it lives in the openworm/validation-data repository. This is Phase A infrastructure — without clean, versioned validation data, no validation tier can function. This DD also serves as the canonical dataset inventory for all phases, consolidating validation, implementation, and projected datasets in one place (see Phase Roadmap for implementation timeline).


Goal & Success Criteria

Criterion Target
Primary: Dataset completeness Every DD010 validation test has its required experimental data committed to openworm/validation-data
Secondary: Format standardization All datasets in machine-readable formats (CSV, NumPy, WCON, NeuroML) with README metadata
Tertiary: Reproducibility Checksums recorded; any contributor can verify data integrity with a single command
Quaternary: Licensing compliance Every dataset has a LICENSE file documenting redistribution rights and original DOI

Deliverables

Artifact Path (in openworm/validation-data) Format
Repository with all datasets openworm/validation-data Git repo
Data manifest manifest.json JSON: dataset_id, source, DOI, license, checksum, format, DD_consumer
Verification script scripts/verify_validation_data.py Python
Per-dataset README {category}/{dataset}/README.md Markdown with provenance
Docker data volume Baked into DD013 Docker validation stage Directory tree at /opt/openworm/validation/data/

Repository & Issues

Item Value
Repository openworm/validation-data (new — to be created)
Issue label dd024
Milestone Phase A: Infrastructure Bootstrap
Branch convention dd024/dataset-name (e.g., dd024/randi2023-functional-connectivity)

Complete Dataset Inventory

Tier 1: Single-Cell Electrophysiology

Dataset Source Publication Neurons Covered Format Needed Acquisition Method Consumer DD Priority Phase
Touch neuron patch-clamp (V_rest, R_in, I-V) Goodman et al. 1998, Neuron 20:763-772 ALM, AVM, PLM (~6 neurons) CSV: neuron, protocol, V/I traces Digitize from paper figures or request from authors DD001, DD005 High Phase 1
AVA interneuron recordings Lockery lab (unpublished / personal communication) AVA CSV: time, V, I Request from Lockery lab DD001, DD005 High Phase 1
ASH nociceptor electrophysiology Hilliard et al. 2002 ASH CSV: time, V, I, stimulus Digitize from paper DD005 Medium Phase 2+
AWC olfactory neuron recordings Chalasani et al. 2007, Nature 450:63-70 AWC CSV: time, V, I, odor Digitize from paper DD005 Medium Phase 2+
RIA compartmentalized calcium Hendricks et al. 2012, Nature 487:99-103 RIA CSV: time, Ca_proximal, Ca_distal Supplement or digitize DD001 (Level D) Medium Phase 2
AWA calcium action potentials Liu et al. 2018, Cell 175:57-70 AWA CSV: time, V, Ca Supplement data DD001 (Level D) Medium Phase 2+
MEC-4 channel kinetics O'Hagan et al. 2005, Nat Neurosci 8:43-50 Touch receptor CSV: strain, current, activation/inactivation curves Digitize from paper DD019 High Phase A
Pharyngeal muscle plateau potentials Raizen & Avery 1994, Neuron 12:483-495 pm3-pm8 CSV: time, V (intracellular recording) Digitize from paper figures DD007 Medium Phase A
Electrophysiology training set (~20 neurons) Goodman lab, Lockery lab, published papers ~20 neuron classes CSV: neuron_class, channel, measured_g, source_doi Curate from multiple papers DD005 High Phase 1

Tier 2: Functional Connectivity

Dataset Source Publication Scale Format Needed Acquisition Method Consumer DD Priority Phase
Whole-brain functional connectivity (wild-type) Randi et al. 2023, Nature 623:406-414 302x302 correlation matrix NumPy .npy Already available via wormneuroatlas API — extract and cache locally DD001, DD005, DD010 Critical Phase A
Whole-brain functional connectivity (unc-31 mutant) Randi et al. 2023 (same paper, supplemental) 302x302 NumPy .npy Via wormneuroatlas API (strain="unc31") DD010 (Tier 4) High Phase 1
Signal propagation atlas Randi et al. 2023 Directed functional connectivity NumPy .npy Via wormneuroatlas API DD001, DD005 High Phase 1
Whole-brain activity during behavioral states Atanas et al. 2022, bioRxiv Time series per neuron during dwelling/roaming HDF5 or CSV Download from supplement / request DD006, DD010 Medium Phase 2+

Tier 3: Behavioral Kinematics

Dataset Source Publication Content Format Needed Acquisition Method Consumer DD Priority Phase
N2 wild-type locomotion baseline Schafer lab / Yemini et al. 2013, Nat Methods 10:877-879 Speed, wavelength, frequency, amplitude WCON (Worm Tracker Commons) Download from wormbase.org/tools/tracker or Open Worm Movement Database DD001, DD002, DD003, DD010 Critical Phase A
unc-2 (Cav2) mutant locomotion Schafer lab Reduced speed, altered gait WCON Same source as above DD010 (Tier 4) High Phase 1
N2 behavioral phenotype statistics Yemini et al. 2013 Population means, CVs for ~700 features CSV from supplement Download supplementary data DD010 (±15% threshold grounding) High Phase A
Defecation cycle periods Thomas 1990, Genetics 124:855-872 ~50s period, posterior-to-anterior wave CSV: animal_id, cycle_start, cycle_end, period Digitize from Table 1 DD009 High Phase A
Pharyngeal pumping EPG Raizen & Avery 1994, Neuron 12:483-495 3-4 Hz pumping frequency, EPG waveform CSV: time, voltage Digitize from figures DD007 Medium Phase A
Egg-laying bout statistics Collins et al. 2016, eLife 5:e21126 Inactive/active bout durations, eggs per bout CSV from supplement Download supplement DD018 Medium Phase 3
Touch response latency Chalfie et al. 1985, J Neurosci 5:956-964 Reversal onset 300-800 ms CSV: stimulus_type, latency Digitize from paper DD019 High Phase A
Foraging behavior decomposition Flavell et al. 2020, Genetics 216:315-332 Dwelling/roaming state durations, transition rates CSV: state, duration, transition_probability Digitize from paper or request DD006 Medium Phase 2
Chemotaxis behavioral data Iino & Yoshida 2009, Bargmann & Horvitz 1991 Chemotaxis assay CSV: chemotaxis index, trajectory data Digitize from papers DD022 Medium Phase 2
Thermotaxis behavioral data Hedgecock & Russell 1975, Mori & Ohshima 1995 Thermotaxis assay CSV: isothermal tracking, cultivation temp preference Digitize from papers DD022 Medium Phase 2
B-class motor neuron stretch response Wen et al. 2012 DB, VB neurons Calcium imaging (DB, VB response to body bending) Extract from paper DD023 Medium Phase 2

Tier 4: Causal / Interventional

Dataset Source Publication Intervention Expected Phenotype Format Needed Consumer DD Priority Phase
Touch neuron ablation Chalfie et al. 1985 Laser ablation of ALM, AVM, PLM Loss of gentle touch response CSV: ablated_neurons, stimulus, response DD019, DD010 High Phase 2
Pharyngeal neuron ablation Avery & Horvitz 1989, Neuron 3:473-485 Laser killing of pharyngeal neurons Pumping persists (semi-autonomous) CSV: ablated_neurons, pumping_frequency DD007, DD010 Medium Phase 3
Neuropeptide knockouts (FLP, NLP) Li et al. 1999; Rogers et al. 2003 Gene deletion Altered locomotion CSV: genotype, speed, reversal_rate DD006, DD010 High Phase 2
unc-103 loss-of-function Collins & Koelle 2013, J Neurosci 33:761-775 ERG channel removal from vm2 Constitutive egg-laying CSV: genotype, egg_count, bout_pattern DD018, DD010 Medium Phase 3
egl-1 loss-of-function Trent et al. 1983, Genetics 104:619-647 HSN cell death Egg-laying defective CSV: genotype, phenotype_class DD018, DD010 Medium Phase 3
Optogenetic single-neuron activation Leifer et al. 2011, Nat Methods 8:147-152 Light activation of specific neurons Stimulus-specific behavioral response CSV: neuron, stimulus, behavior DD010 Low (Phase 3+) Phase 3+

Connectome & Molecular (Supporting)

Dataset Source Content Format Needed Acquisition Method Consumer DD Status Phase
Synaptic + gap junction connectome Cook et al. 2019, Nature 571:63-71 Adjacency matrices Already in cect Via ConnectomeToolbox API DD001, DD020 Available Phase 0
Developmental connectomes Witvliet et al. 2021, Nature 596:257-261 8 animals L1-adult Already in cect Via ConnectomeToolbox API DD001, DD004, DD020 Available Phase 0
Neuropeptidergic connectome Ripoll-Sanchez et al. 2023, Neuron 111:3570-3589 31,479 interactions CSV + in cect Supplement Table S1 + ConnectomeToolbox DD006, DD020 Available Phase 2
CeNGEN L4 expression Taylor et al. 2021, Cell 184:4329-4347 128 classes x 20,500 genes CSV (TPM) via wormneuroatlas API or cengen.org download DD005 Available Phase 1
WBbt cell ontology WormBase 959 somatic cell IDs OWL/OBO WormBase download DD004 Available Phase 4
3D nuclear positions Long et al. 2009, Nat Methods 6:667-672 357 nuclei at L1 CSV: cell_name, x, y, z Supplement DD004, DD006 Needs acquisition Phase 2
NeuroPAL neuron ID atlas Yemini et al. 2021, Cell 184:272-288 Color atlas for all neurons Reference images + ID mapping Published with reagents DD005 Reference only Reference
Ion channel gene list WormBase, CeNGEN ~100 ion channel genes CSV: gene_symbol, channel_family, neuroml_model Curate from WormBase DD005 High Phase 1

Implementation & Reference Datasets

These datasets are inputs to model building (not validation). They are included here as the canonical inventory so there is a single source of truth for all datasets across all phases.

Dataset Source Content Format Consumer DD Status Phase
CeNGEN pharyngeal/intestinal/reproductive expression cengen.org Cell-type-specific expression for non-neural cells CSV (subset of L4 expression) DD007, DD009, DD018 Available (filter CeNGEN L4 by cell type) Phase 3
CE_locomotion stretch receptor model openworm/CE_locomotion C++ reference implementation (StretchReceptor.cpp) C++ source DD023 Available (repo active 2026-02-18) Phase 2
BAAIWorm NMODL ion channel files and SWC morphology data github.com/Jessie940611/BAAIWorm, Apache 2.0, Zenodo: 10.5281/zenodo.13951773 NMODL (.mod) + SWC (.swc) for multicompartmental neurons NMODL + SWC DD001 Level D Stage 1 Available (open-source) Phase 2
Virtual Worm Blender meshes Blender2NeuroML repo (Grove & Sternberg 2012) 688 anatomical meshes, ~1.6M vertices .blend file DD014.2 Available (Virtual_Worm_February_2012.blend) Phase 4
Witvliet 2021 cell boundary meshes Nature 596:257 supplement 3D EM reconstructions per cell OBJ or STL per cell DD004 Needs extraction/conversion from EM data Phase 4
Cell-type mechanical properties Literature review (biomechanics) Elasticity, adhesion per tissue type CSV: cell_type, elasticity_mult, adhesion_strength DD004 Needs curation from biomechanics literature Phase 4
Ion channels with known kinetics PDB + electrophysiology literature ~50-100 channels with measured HH parameters CSV: channel, structure, V_half, k, tau DD025 Needs curation Phase A
C. elegans ion channel sequences WormBase All C. elegans ion channel protein sequences FASTA DD025 Available Phase A
SPH simulation training set Generate from Sibernetic 500+ runs: muscle activation to trajectory pairs HDF5 DD017 Component 2 (surrogate) Generate in Phase 3 (~2,500 GPU-hours) Phase 3
Sensory neuron calcium imaging Suzuki 2003/2008, Chalasani 2007 Stimulus to Ca response curves CSV DD017 Component 4 (learned sensory) Extract from papers Phase 3

Projected Datasets (Phases 5-7)

These datasets are projected needs for future phases. They will be specified in detail when their parent DDs are written.

Dataset Phase Consumer DD Status
Biochemical rate constants (PLC activity, cAMP degradation, PKA/PKC kinetics) Phase 5 Intracellular Signaling (TBD) May be predicted via BioEmu-1 + Boltz-2
Protein abundance (proteomics for C. elegans neurons/muscles) Phase 5 Intracellular Signaling (TBD) Needs proteomics data
Calcium imaging with subcellular resolution (ER, mitochondria, plasma membrane) Phase 5 Intracellular Signaling (TBD) Future experimental technique
C. elegans GPCR-G protein coupling specificity Phase 5 Intracellular Signaling (TBD) Partially available from WormBase
Witvliet series connectomes (8 stages) Phase 6 Developmental Modeling (TBD) Available via cect WitvlietDataReader1-8
CeNGEN L1 expression Phase 6 Developmental Modeling (TBD) Available but less mature than L4
Packer 2019 embryonic scRNA-seq Phase 6 Developmental Modeling (TBD) Needs ingestion
Developmental behavioral data (L1-L4 locomotion, feeding) Phase 6 Developmental Modeling (TBD) Needs curation
DevoWorm embryogenetic connectome Phase 6 Developmental Modeling (TBD) Available (open-source)
DevoWorm differentiation trees Phase 6 Developmental Modeling (TBD) Available
Cook 2019 male connectome Phase 7 Male-Specific Modeling (TBD) Available via cect
Male behavioral data (mating assays) Phase 7 Male-Specific Modeling (TBD) Needs curation
Male-specific anatomy (tail SPH model) Phase 7 Male-Specific Modeling (TBD) Needs creation

Data Format Standards

All datasets in openworm/validation-data must follow these conventions:

File Formats

Data Type Format Justification
Time series (V, Ca, I) CSV with header row: time_ms, value, unit Universal readability, git-friendly
Correlation matrices NumPy .npy with companion .json metadata Efficient for large matrices, metadata preserves neuron ordering
Movement trajectories WCON 1.0 (per DD021) Standard format for C. elegans tracking data
Behavioral statistics CSV with header: genotype, metric, mean, std, n, source_doi Machine-readable, self-documenting
Intervention/ablation data CSV with header: genotype_or_ablation, stimulus, metric, value, n, source_doi Uniform causal validation format
Expression matrices CSV (gene x cell class) or via wormneuroatlas API cache Consistent with DD005 pipeline

Metadata Requirements

Every dataset directory must contain a README.md with:

# Dataset: {name}

**Source:** {author} et al. ({year}), *{journal}* {volume}:{pages}
**DOI:** {doi}
**License:** {license or "Fair use — digitized from published figures"}
**Acquired:** {date}
**Acquired by:** {person or script}

## Description
{What this dataset contains and why it matters for validation}

## Files
| File | Description | Rows | Columns |
|------|-------------|------|---------|
| ... | ... | ... | ... |

## Provenance
{How the data was obtained: API call, supplement download, figure digitization}
{Any transformations applied: unit conversion, column renaming, neuron ID normalization}

## Consumer DDs
{Which DDs use this data and for what validation tier}

Checksums

A root checksums.sha256 file records the SHA-256 hash of every data file. The verification script checks all hashes on Docker build and on verify_validation_data.py execution.


Acquisition Priorities

Acquisition priorities align with the Phase Roadmap implementation schedule. Datasets are prioritized within each phase by blocking impact.

Phase A (Weeks 1-4) — Must Have

These datasets are blocking for the two critical validation tiers (Tier 2 and Tier 3):

  1. Randi 2023 functional connectivity (Tier 2) — Extract from wormneuroatlas API, cache as .npy. ~2 hours.
  2. Schafer lab N2 baseline WCON (Tier 3) — Download from Worm Tracker database or WormBase. ~4 hours (format verification).
  3. Yemini 2013 behavioral statistics (Tier 3 threshold grounding) — Download supplement CSV. ~1 hour.
  4. Thomas 1990 defecation periods (DD009 Tier 3) — Digitize Table 1. ~2 hours.
  5. Raizen 1994 pumping frequency (DD007 Tier 3) — Digitize from figures. ~3 hours.
  6. O'Hagan 2005 MEC-4 kinetics (DD019 Tier 1) — Digitize activation/inactivation curves. ~4 hours.
  7. Chalfie 1985 touch response (DD019 Tier 3 + Tier 4) — Digitize latency data. ~2 hours.

Estimated total: ~18 hours

Phase 1 (Months 1-3) — Should Have

  1. Goodman 1998 touch neuron electrophysiology (Tier 1) — Digitize I-V curves. ~4 hours.
  2. Collins 2016 egg-laying statistics (DD018 Tier 3) — Download supplement. ~2 hours.
  3. Flavell 2020 dwelling/roaming statistics (DD006 Tier 4) — Digitize or request. ~3 hours.
  4. Randi 2023 unc-31 mutant (Tier 4) — Extract from wormneuroatlas. ~1 hour.
  5. Long 2009 3D nuclear positions (DD004, DD006) — Download supplement. ~2 hours.
  6. Schafer lab unc-2 mutant WCON (Tier 4) — Download from tracker database. ~2 hours.

Phase 2+ — Nice to Have

  1. Hendricks 2012 RIA compartmentalized calcium — For DD001 Level D validation.
  2. Liu 2018 AWA calcium data — For DD001 Level D validation.
  3. Atanas 2022 whole-brain behavioral states — For DD006 state transition validation.
  4. Optogenetic perturbation data (Leifer, Randi) — As published datasets become available.
  5. Hilliard 2002 ASH, Chalasani 2007 AWC — For DD005 calibration training set expansion.

How to Build & Test

Prerequisites

  • Docker with docker compose (DD013 simulation stack)
  • OR: Python 3.10+, pip
  • For data acquisition scripts: wormneuroatlas, connectometoolbox (cect), pandas, numpy
  • For WCON validation: tracker-commons Python package (per DD021)

Getting Started (Environment Setup)

There are two paths: Docker (recommended for verification) and native Python (for data acquisition and curation).

Clone the repository:

git clone https://github.com/openworm/validation-data.git
cd validation-data

Path A — Docker (verification and CI):

Validation data is baked into the DD013 Docker build at the validation stage. To verify all datasets:

# From the OpenWorm meta-repo:
cd /path/to/OpenWorm
docker compose run shell python scripts/verify_validation_data.py
# Green light: all datasets present, all checksums match, all READMEs exist

The Docker build copies data from the openworm/validation-data repo into /opt/openworm/validation/data/ and runs the verification script automatically. If any dataset is missing or has an incorrect checksum, the Docker build fails.

Path B — Native Python (data acquisition and curation):

cd validation-data

# Install verification and acquisition dependencies
pip install numpy pandas scipy

# Install data source APIs
pip install wormneuroatlas          # Randi 2023 functional connectivity, CeNGEN expression
pip install connectometoolbox       # cect — connectome data access (DD020)

# Install WCON format tools (for behavioral kinematics data)
pip install tracker-commons         # per DD021

# Verify existing datasets
python scripts/verify_validation_data.py

For acquiring new datasets, see the acquisition workflow below. Each dataset requires: data files, a README.md with provenance metadata, manifest entry, and SHA-256 checksum.

Adding a New Dataset

# 1. Create dataset directory
mkdir -p data/tier3_behavioral/thomas1990_defecation/

# 2. Add data files
cp digitized_data.csv data/tier3_behavioral/thomas1990_defecation/defecation_periods.csv

# 3. Write README.md with provenance
# (follow template above)

# 4. Update manifest
python scripts/update_manifest.py \
    --dataset thomas1990_defecation \
    --doi "10.1534/genetics.124.4.855" \
    --consumer-dds DD009,DD010 \
    --tier 3

# 5. Update checksums
python scripts/update_checksums.py

# 6. Verify everything
python scripts/verify_validation_data.py
# Green light: all datasets present, all checksums match, all READMEs exist

Verification Script

# scripts/verify_validation_data.py
# Checks:
# 1. All datasets listed in manifest.json exist on disk
# 2. SHA-256 checksums match
# 3. Every dataset directory has a README.md
# 4. CSV files are parseable with expected columns
# 5. NumPy files have expected shapes
# 6. WCON files pass basic format validation
# Exit code 0 = all checks pass; non-zero = failure with specific error

Context & Background

ConnectomeToolbox (cect) — DD020

cect already provides programmatic access to connectome datasets (Cook 2019, Witvliet 2021, Ripoll-Sanchez 2023). DD024 does NOT duplicate this. Instead:

  • Connectome data remains in cect (the canonical API)
  • DD024 stores validation data (experimental recordings, behavioral measurements) that cect does not cover
  • For Randi 2023 functional connectivity, DD024 caches the output of wormneuroatlas API calls as static files to avoid runtime dependencies

OWMeta (DD008)

OWMeta is the semantic knowledge graph for C. elegans biological facts. DD024 complements it:

  • OWMeta stores structured biological knowledge (cell types, gene functions, anatomical relationships)
  • DD024 stores quantitative experimental recordings used specifically for model validation
  • In Phase 3+, OWMeta may ingest DD024 data as validation-specific data types

Docker Integration (DD013)

DD024 data is baked into the Docker validation stage at build time:

# In DD013 multi-stage Dockerfile
FROM validation-base AS validation
COPY --from=openworm/validation-data:v1.0 /data /opt/openworm/validation/data/
RUN python scripts/verify_validation_data.py

Data is NOT downloaded at runtime — all validation data is pre-packaged for reproducibility and offline CI.


Alternatives Considered

1. Download All Datasets Upfront

Rejected: Multi-GB downloads slow onboarding; some datasets require API access or author permission. A phased acquisition pipeline is more practical.

2. Rely on Published Summary Statistics Only

Rejected: Need raw traces for correlation matrix computation (DD010 Tier 2) and waveform-level comparisons. Summary statistics are insufficient for rigorous validation.

3. Custom Data Format

Rejected: Use existing community formats (NWB, WCON per DD021) to maximize interoperability and reduce maintenance burden.

4. Manual Data Curation Only

Rejected: Automated pipeline with manifest.json, checksums, and verification scripts ensures reproducibility and freshness across contributors.


Quality Criteria

  1. Provenance metadata: All acquired datasets must have provenance metadata (source, version, download date, checksum) in their README.md.
  2. Refreshable cache: Cached data must be refreshable without breaking downstream validation scripts.
  3. Version pinning: Dataset versions must be pinned in checksums.sha256 for reproducibility.
  4. CI compatibility: Data acquisition and verification scripts must run successfully in CI (Docker environment per DD013).

Boundaries (Out of Scope)

  1. Raw imaging data: We store derived/processed data (correlation matrices, extracted features, digitized traces), not raw calcium imaging volumes or EM stacks. Raw data is too large for Git and available from original authors.

  2. Proprietary or restricted data: Only openly redistributable data (CC-BY, CC0, or fair-use digitization from published figures) is included. Data requiring DTA or institutional agreement is documented in the manifest but not stored.

  3. Simulation output data: DD024 stores experimental data for validation. Simulated reference outputs (baseline scores, expected trajectories) are generated by DD010/DD013 CI pipeline, not pre-stored.

  4. Data analysis scripts: Scripts that use validation data (correlation computation, feature extraction) live in their respective DD repositories (DD010, DD021). DD024 only stores data and verification scripts.


Configuration

# openworm.yml section
validation:
  data_path: /opt/openworm/validation/data/   # Docker default
  data_version: "v1.0"                         # Tag in openworm/validation-data
  verify_checksums: true                        # Verify on Docker build

References

  1. Randi F et al. (2023). "Neural signal propagation atlas of Caenorhabditis elegans." Nature 623:406-414.
  2. Yemini E et al. (2013). "A database of Caenorhabditis elegans behavioral phenotypes." Nature Methods 10:877-879.
  3. Thomas JH (1990). "Genetic analysis of defecation in Caenorhabditis elegans." Genetics 124:855-872.
  4. Raizen DM, Avery L (1994). "Electrical activity and behavior in the pharynx of Caenorhabditis elegans." Neuron 12:483-495.
  5. O'Hagan R, Chalfie M, Bhatt R (2005). "The MEC-4 DEG/ENaC channel of Caenorhabditis elegans touch receptor neurons transduces mechanical signals." Nature Neurosci 8:43-50.
  6. Chalfie M et al. (1985). "The neural circuit for touch sensitivity in Caenorhabditis elegans." J Neurosci 5:956-964.
  7. Collins KM et al. (2016). "Activity of the C. elegans egg-laying behavior circuit is controlled by competing activation and feedback inhibition." eLife 5:e21126.
  8. Flavell SW, Raizen DM, You YJ (2020). "Behavioral States." Genetics 216:315-332.
  9. Goodman MB, Hall DH, Avery L, Bhatt R (1998). "Active currents regulate sensitivity and dynamic range in C. elegans neurons." Neuron 20:763-772.
  10. Pearl J, Mackenzie D (2018). The Book of Why. Basic Books. (Motivates causal/interventional validation data.)

Integration Contract

Inputs (What This Subsystem Consumes)

Input Source Description
Published papers PubMed / journal websites Source material for digitization
wormneuroatlas API PyPI package Randi 2023, CeNGEN programmatic access
ConnectomeToolbox (cect) PyPI package Connectome data (not stored in DD024, but referenced)
Supplement files Journal supplement pages Raw data tables from publications

Outputs (What This Subsystem Produces)

Output Consumer DD Description
Electrophysiology CSVs DD001, DD005, DD007, DD019 (Tier 1) Patch-clamp, V-clamp, channel kinetics
Functional connectivity matrices DD001, DD005, DD010 (Tier 2) Randi 2023 302x302 .npy files
Behavioral kinematic data DD001-DD003, DD007, DD009, DD018, DD019, DD010 (Tier 3) WCON files, defecation/pumping CSVs
Intervention/perturbation data DD006, DD007, DD010, DD018, DD019 (Tier 4) Ablation, mutant, and knockout phenotype CSVs
Docker data volume DD013 (Docker build) /opt/openworm/validation/data/ tree
Data manifest DD010 (validation runner) manifest.json mapping datasets to DDs and tiers

Coupling Dependencies

I Depend On DD What Breaks If They Change
Neuron naming convention DD020 (ConnectomeToolbox) If neuron IDs change in cect, cached Randi 2023 matrix column labels may break
WCON format spec DD021 If WCON version changes, kinematics files may need re-export
Docker build pipeline DD013 If Docker stage names or paths change, data COPY step breaks
Depends On Me DD What Breaks If I Change
All validation tiers DD010 If data format or file paths change, validation scripts break
CI pipeline DD013 If Docker data volume path changes, docker compose run validate can't find data
All subsystem DDs DD001-DD019 If a dataset is removed or reformatted, the consuming DD's validation fails

  • Approved by: Pending
  • Implementation Status: Proposed
  • Next Actions:

  • Create openworm/validation-data GitHub repository

  • Acquire Phase A datasets (7 datasets, ~18 hours)
  • Write verify_validation_data.py script
  • Integrate into DD013 Docker build
  • Announce in next board sync for contributor help with digitization tasks