DD024: Validation Data Acquisition Pipeline¶
- Status: Proposed (Phase A — Infrastructure)
- Author: OpenWorm Core Team
- Date: 2026-02-21
- Supersedes: None
- Related: DD010 (Validation Framework), DD008 (Data Integration Pipeline), DD020 (Connectome Data Access), DD013 (Simulation Stack), DD021 (Movement Analysis Toolbox)
Phase: Phase 1: Core Neural Platform | Layer: Data Infrastructure
Quick Action Reference¶
| Question | Answer |
|---|---|
| What does this produce? | Version-controlled repository of all experimental datasets needed for DD010 validation across tiers 1-4 and all subsystem DDs |
| Success metric | Every DD010 validation test can run against locally cached, versioned data without requiring external API calls at runtime |
| Repository | openworm/validation-data (new repo) — issues labeled dd024 |
| Config toggle | validation.data_path: /opt/openworm/validation/data/ in openworm.yml |
| Build & test | docker compose run shell python scripts/verify_validation_data.py — checks all datasets present, checksums match |
| Visualize | N/A (data infrastructure, not a model) |
| CI gate | Docker build fails if any required dataset is missing or has incorrect checksum |
TL;DR¶
Every subsystem DD (DD001-DD009, DD018-DD019) specifies validation targets that depend on published experimental data, but no DD owns the systematic acquisition, formatting, and version control of that data. This DD fills that gap. It catalogs every dataset referenced by DD010's four validation tiers, defines how each is acquired (API, supplement download, manual digitization), what format it is stored in, and where it lives in the openworm/validation-data repository. This is Phase A infrastructure — without clean, versioned validation data, no validation tier can function. This DD also serves as the canonical dataset inventory for all phases, consolidating validation, implementation, and projected datasets in one place (see Phase Roadmap for implementation timeline).
Goal & Success Criteria¶
| Criterion | Target |
|---|---|
| Primary: Dataset completeness | Every DD010 validation test has its required experimental data committed to openworm/validation-data |
| Secondary: Format standardization | All datasets in machine-readable formats (CSV, NumPy, WCON, NeuroML) with README metadata |
| Tertiary: Reproducibility | Checksums recorded; any contributor can verify data integrity with a single command |
| Quaternary: Licensing compliance | Every dataset has a LICENSE file documenting redistribution rights and original DOI |
Deliverables¶
| Artifact | Path (in openworm/validation-data) |
Format |
|---|---|---|
| Repository with all datasets | openworm/validation-data |
Git repo |
| Data manifest | manifest.json |
JSON: dataset_id, source, DOI, license, checksum, format, DD_consumer |
| Verification script | scripts/verify_validation_data.py |
Python |
| Per-dataset README | {category}/{dataset}/README.md |
Markdown with provenance |
| Docker data volume | Baked into DD013 Docker validation stage |
Directory tree at /opt/openworm/validation/data/ |
Repository & Issues¶
| Item | Value |
|---|---|
| Repository | openworm/validation-data (new — to be created) |
| Issue label | dd024 |
| Milestone | Phase A: Infrastructure Bootstrap |
| Branch convention | dd024/dataset-name (e.g., dd024/randi2023-functional-connectivity) |
Complete Dataset Inventory¶
Tier 1: Single-Cell Electrophysiology¶
| Dataset | Source Publication | Neurons Covered | Format Needed | Acquisition Method | Consumer DD | Priority | Phase |
|---|---|---|---|---|---|---|---|
| Touch neuron patch-clamp (V_rest, R_in, I-V) | Goodman et al. 1998, Neuron 20:763-772 | ALM, AVM, PLM (~6 neurons) | CSV: neuron, protocol, V/I traces | Digitize from paper figures or request from authors | DD001, DD005 | High | Phase 1 |
| AVA interneuron recordings | Lockery lab (unpublished / personal communication) | AVA | CSV: time, V, I | Request from Lockery lab | DD001, DD005 | High | Phase 1 |
| ASH nociceptor electrophysiology | Hilliard et al. 2002 | ASH | CSV: time, V, I, stimulus | Digitize from paper | DD005 | Medium | Phase 2+ |
| AWC olfactory neuron recordings | Chalasani et al. 2007, Nature 450:63-70 | AWC | CSV: time, V, I, odor | Digitize from paper | DD005 | Medium | Phase 2+ |
| RIA compartmentalized calcium | Hendricks et al. 2012, Nature 487:99-103 | RIA | CSV: time, Ca_proximal, Ca_distal | Supplement or digitize | DD001 (Level D) | Medium | Phase 2 |
| AWA calcium action potentials | Liu et al. 2018, Cell 175:57-70 | AWA | CSV: time, V, Ca | Supplement data | DD001 (Level D) | Medium | Phase 2+ |
| MEC-4 channel kinetics | O'Hagan et al. 2005, Nat Neurosci 8:43-50 | Touch receptor | CSV: strain, current, activation/inactivation curves | Digitize from paper | DD019 | High | Phase A |
| Pharyngeal muscle plateau potentials | Raizen & Avery 1994, Neuron 12:483-495 | pm3-pm8 | CSV: time, V (intracellular recording) | Digitize from paper figures | DD007 | Medium | Phase A |
| Electrophysiology training set (~20 neurons) | Goodman lab, Lockery lab, published papers | ~20 neuron classes | CSV: neuron_class, channel, measured_g, source_doi | Curate from multiple papers | DD005 | High | Phase 1 |
Tier 2: Functional Connectivity¶
| Dataset | Source Publication | Scale | Format Needed | Acquisition Method | Consumer DD | Priority | Phase |
|---|---|---|---|---|---|---|---|
| Whole-brain functional connectivity (wild-type) | Randi et al. 2023, Nature 623:406-414 | 302x302 correlation matrix | NumPy .npy |
Already available via wormneuroatlas API — extract and cache locally |
DD001, DD005, DD010 | Critical | Phase A |
| Whole-brain functional connectivity (unc-31 mutant) | Randi et al. 2023 (same paper, supplemental) | 302x302 | NumPy .npy |
Via wormneuroatlas API (strain="unc31") |
DD010 (Tier 4) | High | Phase 1 |
| Signal propagation atlas | Randi et al. 2023 | Directed functional connectivity | NumPy .npy |
Via wormneuroatlas API |
DD001, DD005 | High | Phase 1 |
| Whole-brain activity during behavioral states | Atanas et al. 2022, bioRxiv | Time series per neuron during dwelling/roaming | HDF5 or CSV | Download from supplement / request | DD006, DD010 | Medium | Phase 2+ |
Tier 3: Behavioral Kinematics¶
| Dataset | Source Publication | Content | Format Needed | Acquisition Method | Consumer DD | Priority | Phase |
|---|---|---|---|---|---|---|---|
| N2 wild-type locomotion baseline | Schafer lab / Yemini et al. 2013, Nat Methods 10:877-879 | Speed, wavelength, frequency, amplitude | WCON (Worm Tracker Commons) | Download from wormbase.org/tools/tracker or Open Worm Movement Database | DD001, DD002, DD003, DD010 | Critical | Phase A |
| unc-2 (Cav2) mutant locomotion | Schafer lab | Reduced speed, altered gait | WCON | Same source as above | DD010 (Tier 4) | High | Phase 1 |
| N2 behavioral phenotype statistics | Yemini et al. 2013 | Population means, CVs for ~700 features | CSV from supplement | Download supplementary data | DD010 (±15% threshold grounding) | High | Phase A |
| Defecation cycle periods | Thomas 1990, Genetics 124:855-872 | ~50s period, posterior-to-anterior wave | CSV: animal_id, cycle_start, cycle_end, period | Digitize from Table 1 | DD009 | High | Phase A |
| Pharyngeal pumping EPG | Raizen & Avery 1994, Neuron 12:483-495 | 3-4 Hz pumping frequency, EPG waveform | CSV: time, voltage | Digitize from figures | DD007 | Medium | Phase A |
| Egg-laying bout statistics | Collins et al. 2016, eLife 5:e21126 | Inactive/active bout durations, eggs per bout | CSV from supplement | Download supplement | DD018 | Medium | Phase 3 |
| Touch response latency | Chalfie et al. 1985, J Neurosci 5:956-964 | Reversal onset 300-800 ms | CSV: stimulus_type, latency | Digitize from paper | DD019 | High | Phase A |
| Foraging behavior decomposition | Flavell et al. 2020, Genetics 216:315-332 | Dwelling/roaming state durations, transition rates | CSV: state, duration, transition_probability | Digitize from paper or request | DD006 | Medium | Phase 2 |
| Chemotaxis behavioral data | Iino & Yoshida 2009, Bargmann & Horvitz 1991 | Chemotaxis assay | CSV: chemotaxis index, trajectory data | Digitize from papers | DD022 | Medium | Phase 2 |
| Thermotaxis behavioral data | Hedgecock & Russell 1975, Mori & Ohshima 1995 | Thermotaxis assay | CSV: isothermal tracking, cultivation temp preference | Digitize from papers | DD022 | Medium | Phase 2 |
| B-class motor neuron stretch response | Wen et al. 2012 | DB, VB neurons | Calcium imaging (DB, VB response to body bending) | Extract from paper | DD023 | Medium | Phase 2 |
Tier 4: Causal / Interventional¶
| Dataset | Source Publication | Intervention | Expected Phenotype | Format Needed | Consumer DD | Priority | Phase |
|---|---|---|---|---|---|---|---|
| Touch neuron ablation | Chalfie et al. 1985 | Laser ablation of ALM, AVM, PLM | Loss of gentle touch response | CSV: ablated_neurons, stimulus, response | DD019, DD010 | High | Phase 2 |
| Pharyngeal neuron ablation | Avery & Horvitz 1989, Neuron 3:473-485 | Laser killing of pharyngeal neurons | Pumping persists (semi-autonomous) | CSV: ablated_neurons, pumping_frequency | DD007, DD010 | Medium | Phase 3 |
| Neuropeptide knockouts (FLP, NLP) | Li et al. 1999; Rogers et al. 2003 | Gene deletion | Altered locomotion | CSV: genotype, speed, reversal_rate | DD006, DD010 | High | Phase 2 |
| unc-103 loss-of-function | Collins & Koelle 2013, J Neurosci 33:761-775 | ERG channel removal from vm2 | Constitutive egg-laying | CSV: genotype, egg_count, bout_pattern | DD018, DD010 | Medium | Phase 3 |
| egl-1 loss-of-function | Trent et al. 1983, Genetics 104:619-647 | HSN cell death | Egg-laying defective | CSV: genotype, phenotype_class | DD018, DD010 | Medium | Phase 3 |
| Optogenetic single-neuron activation | Leifer et al. 2011, Nat Methods 8:147-152 | Light activation of specific neurons | Stimulus-specific behavioral response | CSV: neuron, stimulus, behavior | DD010 | Low (Phase 3+) | Phase 3+ |
Connectome & Molecular (Supporting)¶
| Dataset | Source | Content | Format Needed | Acquisition Method | Consumer DD | Status | Phase |
|---|---|---|---|---|---|---|---|
| Synaptic + gap junction connectome | Cook et al. 2019, Nature 571:63-71 | Adjacency matrices | Already in cect |
Via ConnectomeToolbox API | DD001, DD020 | Available | Phase 0 |
| Developmental connectomes | Witvliet et al. 2021, Nature 596:257-261 | 8 animals L1-adult | Already in cect |
Via ConnectomeToolbox API | DD001, DD004, DD020 | Available | Phase 0 |
| Neuropeptidergic connectome | Ripoll-Sanchez et al. 2023, Neuron 111:3570-3589 | 31,479 interactions | CSV + in cect |
Supplement Table S1 + ConnectomeToolbox | DD006, DD020 | Available | Phase 2 |
| CeNGEN L4 expression | Taylor et al. 2021, Cell 184:4329-4347 | 128 classes x 20,500 genes | CSV (TPM) via wormneuroatlas |
API or cengen.org download | DD005 | Available | Phase 1 |
| WBbt cell ontology | WormBase | 959 somatic cell IDs | OWL/OBO | WormBase download | DD004 | Available | Phase 4 |
| 3D nuclear positions | Long et al. 2009, Nat Methods 6:667-672 | 357 nuclei at L1 | CSV: cell_name, x, y, z | Supplement | DD004, DD006 | Needs acquisition | Phase 2 |
| NeuroPAL neuron ID atlas | Yemini et al. 2021, Cell 184:272-288 | Color atlas for all neurons | Reference images + ID mapping | Published with reagents | DD005 | Reference only | Reference |
| Ion channel gene list | WormBase, CeNGEN | ~100 ion channel genes | CSV: gene_symbol, channel_family, neuroml_model | Curate from WormBase | DD005 | High | Phase 1 |
Implementation & Reference Datasets¶
These datasets are inputs to model building (not validation). They are included here as the canonical inventory so there is a single source of truth for all datasets across all phases.
| Dataset | Source | Content | Format | Consumer DD | Status | Phase |
|---|---|---|---|---|---|---|
| CeNGEN pharyngeal/intestinal/reproductive expression | cengen.org | Cell-type-specific expression for non-neural cells | CSV (subset of L4 expression) | DD007, DD009, DD018 | Available (filter CeNGEN L4 by cell type) | Phase 3 |
| CE_locomotion stretch receptor model | openworm/CE_locomotion | C++ reference implementation (StretchReceptor.cpp) | C++ source | DD023 | Available (repo active 2026-02-18) | Phase 2 |
| BAAIWorm NMODL ion channel files and SWC morphology data | github.com/Jessie940611/BAAIWorm, Apache 2.0, Zenodo: 10.5281/zenodo.13951773 | NMODL (.mod) + SWC (.swc) for multicompartmental neurons | NMODL + SWC | DD001 Level D Stage 1 | Available (open-source) | Phase 2 |
| Virtual Worm Blender meshes | Blender2NeuroML repo (Grove & Sternberg 2012) | 688 anatomical meshes, ~1.6M vertices | .blend file | DD014.2 | Available (Virtual_Worm_February_2012.blend) | Phase 4 |
| Witvliet 2021 cell boundary meshes | Nature 596:257 supplement | 3D EM reconstructions per cell | OBJ or STL per cell | DD004 | Needs extraction/conversion from EM data | Phase 4 |
| Cell-type mechanical properties | Literature review (biomechanics) | Elasticity, adhesion per tissue type | CSV: cell_type, elasticity_mult, adhesion_strength | DD004 | Needs curation from biomechanics literature | Phase 4 |
| Ion channels with known kinetics | PDB + electrophysiology literature | ~50-100 channels with measured HH parameters | CSV: channel, structure, V_half, k, tau | DD025 | Needs curation | Phase A |
| C. elegans ion channel sequences | WormBase | All C. elegans ion channel protein sequences | FASTA | DD025 | Available | Phase A |
| SPH simulation training set | Generate from Sibernetic | 500+ runs: muscle activation to trajectory pairs | HDF5 | DD017 Component 2 (surrogate) | Generate in Phase 3 (~2,500 GPU-hours) | Phase 3 |
| Sensory neuron calcium imaging | Suzuki 2003/2008, Chalasani 2007 | Stimulus to Ca response curves | CSV | DD017 Component 4 (learned sensory) | Extract from papers | Phase 3 |
Projected Datasets (Phases 5-7)¶
These datasets are projected needs for future phases. They will be specified in detail when their parent DDs are written.
| Dataset | Phase | Consumer DD | Status |
|---|---|---|---|
| Biochemical rate constants (PLC activity, cAMP degradation, PKA/PKC kinetics) | Phase 5 | Intracellular Signaling (TBD) | May be predicted via BioEmu-1 + Boltz-2 |
| Protein abundance (proteomics for C. elegans neurons/muscles) | Phase 5 | Intracellular Signaling (TBD) | Needs proteomics data |
| Calcium imaging with subcellular resolution (ER, mitochondria, plasma membrane) | Phase 5 | Intracellular Signaling (TBD) | Future experimental technique |
| C. elegans GPCR-G protein coupling specificity | Phase 5 | Intracellular Signaling (TBD) | Partially available from WormBase |
| Witvliet series connectomes (8 stages) | Phase 6 | Developmental Modeling (TBD) | Available via cect WitvlietDataReader1-8 |
| CeNGEN L1 expression | Phase 6 | Developmental Modeling (TBD) | Available but less mature than L4 |
| Packer 2019 embryonic scRNA-seq | Phase 6 | Developmental Modeling (TBD) | Needs ingestion |
| Developmental behavioral data (L1-L4 locomotion, feeding) | Phase 6 | Developmental Modeling (TBD) | Needs curation |
| DevoWorm embryogenetic connectome | Phase 6 | Developmental Modeling (TBD) | Available (open-source) |
| DevoWorm differentiation trees | Phase 6 | Developmental Modeling (TBD) | Available |
| Cook 2019 male connectome | Phase 7 | Male-Specific Modeling (TBD) | Available via cect |
| Male behavioral data (mating assays) | Phase 7 | Male-Specific Modeling (TBD) | Needs curation |
| Male-specific anatomy (tail SPH model) | Phase 7 | Male-Specific Modeling (TBD) | Needs creation |
Data Format Standards¶
All datasets in openworm/validation-data must follow these conventions:
File Formats¶
| Data Type | Format | Justification |
|---|---|---|
| Time series (V, Ca, I) | CSV with header row: time_ms, value, unit |
Universal readability, git-friendly |
| Correlation matrices | NumPy .npy with companion .json metadata |
Efficient for large matrices, metadata preserves neuron ordering |
| Movement trajectories | WCON 1.0 (per DD021) | Standard format for C. elegans tracking data |
| Behavioral statistics | CSV with header: genotype, metric, mean, std, n, source_doi |
Machine-readable, self-documenting |
| Intervention/ablation data | CSV with header: genotype_or_ablation, stimulus, metric, value, n, source_doi |
Uniform causal validation format |
| Expression matrices | CSV (gene x cell class) or via wormneuroatlas API cache |
Consistent with DD005 pipeline |
Metadata Requirements¶
Every dataset directory must contain a README.md with:
# Dataset: {name}
**Source:** {author} et al. ({year}), *{journal}* {volume}:{pages}
**DOI:** {doi}
**License:** {license or "Fair use — digitized from published figures"}
**Acquired:** {date}
**Acquired by:** {person or script}
## Description
{What this dataset contains and why it matters for validation}
## Files
| File | Description | Rows | Columns |
|------|-------------|------|---------|
| ... | ... | ... | ... |
## Provenance
{How the data was obtained: API call, supplement download, figure digitization}
{Any transformations applied: unit conversion, column renaming, neuron ID normalization}
## Consumer DDs
{Which DDs use this data and for what validation tier}
Checksums¶
A root checksums.sha256 file records the SHA-256 hash of every data file. The verification script checks all hashes on Docker build and on verify_validation_data.py execution.
Acquisition Priorities¶
Acquisition priorities align with the Phase Roadmap implementation schedule. Datasets are prioritized within each phase by blocking impact.
Phase A (Weeks 1-4) — Must Have¶
These datasets are blocking for the two critical validation tiers (Tier 2 and Tier 3):
- Randi 2023 functional connectivity (Tier 2) — Extract from
wormneuroatlasAPI, cache as.npy. ~2 hours. - Schafer lab N2 baseline WCON (Tier 3) — Download from Worm Tracker database or WormBase. ~4 hours (format verification).
- Yemini 2013 behavioral statistics (Tier 3 threshold grounding) — Download supplement CSV. ~1 hour.
- Thomas 1990 defecation periods (DD009 Tier 3) — Digitize Table 1. ~2 hours.
- Raizen 1994 pumping frequency (DD007 Tier 3) — Digitize from figures. ~3 hours.
- O'Hagan 2005 MEC-4 kinetics (DD019 Tier 1) — Digitize activation/inactivation curves. ~4 hours.
- Chalfie 1985 touch response (DD019 Tier 3 + Tier 4) — Digitize latency data. ~2 hours.
Estimated total: ~18 hours
Phase 1 (Months 1-3) — Should Have¶
- Goodman 1998 touch neuron electrophysiology (Tier 1) — Digitize I-V curves. ~4 hours.
- Collins 2016 egg-laying statistics (DD018 Tier 3) — Download supplement. ~2 hours.
- Flavell 2020 dwelling/roaming statistics (DD006 Tier 4) — Digitize or request. ~3 hours.
- Randi 2023 unc-31 mutant (Tier 4) — Extract from
wormneuroatlas. ~1 hour. - Long 2009 3D nuclear positions (DD004, DD006) — Download supplement. ~2 hours.
- Schafer lab unc-2 mutant WCON (Tier 4) — Download from tracker database. ~2 hours.
Phase 2+ — Nice to Have¶
- Hendricks 2012 RIA compartmentalized calcium — For DD001 Level D validation.
- Liu 2018 AWA calcium data — For DD001 Level D validation.
- Atanas 2022 whole-brain behavioral states — For DD006 state transition validation.
- Optogenetic perturbation data (Leifer, Randi) — As published datasets become available.
- Hilliard 2002 ASH, Chalasani 2007 AWC — For DD005 calibration training set expansion.
How to Build & Test¶
Prerequisites¶
- Docker with
docker compose(DD013 simulation stack) - OR: Python 3.10+, pip
- For data acquisition scripts:
wormneuroatlas,connectometoolbox(cect),pandas,numpy - For WCON validation:
tracker-commonsPython package (per DD021)
Getting Started (Environment Setup)¶
There are two paths: Docker (recommended for verification) and native Python (for data acquisition and curation).
Clone the repository:
git clone https://github.com/openworm/validation-data.git
cd validation-data
Path A — Docker (verification and CI):
Validation data is baked into the DD013 Docker build at the validation stage. To verify all datasets:
# From the OpenWorm meta-repo:
cd /path/to/OpenWorm
docker compose run shell python scripts/verify_validation_data.py
# Green light: all datasets present, all checksums match, all READMEs exist
The Docker build copies data from the openworm/validation-data repo into /opt/openworm/validation/data/ and runs the verification script automatically. If any dataset is missing or has an incorrect checksum, the Docker build fails.
Path B — Native Python (data acquisition and curation):
cd validation-data
# Install verification and acquisition dependencies
pip install numpy pandas scipy
# Install data source APIs
pip install wormneuroatlas # Randi 2023 functional connectivity, CeNGEN expression
pip install connectometoolbox # cect — connectome data access (DD020)
# Install WCON format tools (for behavioral kinematics data)
pip install tracker-commons # per DD021
# Verify existing datasets
python scripts/verify_validation_data.py
For acquiring new datasets, see the acquisition workflow below. Each dataset requires: data files, a README.md with provenance metadata, manifest entry, and SHA-256 checksum.
Adding a New Dataset¶
# 1. Create dataset directory
mkdir -p data/tier3_behavioral/thomas1990_defecation/
# 2. Add data files
cp digitized_data.csv data/tier3_behavioral/thomas1990_defecation/defecation_periods.csv
# 3. Write README.md with provenance
# (follow template above)
# 4. Update manifest
python scripts/update_manifest.py \
--dataset thomas1990_defecation \
--doi "10.1534/genetics.124.4.855" \
--consumer-dds DD009,DD010 \
--tier 3
# 5. Update checksums
python scripts/update_checksums.py
# 6. Verify everything
python scripts/verify_validation_data.py
# Green light: all datasets present, all checksums match, all READMEs exist
Verification Script¶
# scripts/verify_validation_data.py
# Checks:
# 1. All datasets listed in manifest.json exist on disk
# 2. SHA-256 checksums match
# 3. Every dataset directory has a README.md
# 4. CSV files are parseable with expected columns
# 5. NumPy files have expected shapes
# 6. WCON files pass basic format validation
# Exit code 0 = all checks pass; non-zero = failure with specific error
Context & Background¶
ConnectomeToolbox (cect) — DD020¶
cect already provides programmatic access to connectome datasets (Cook 2019, Witvliet 2021, Ripoll-Sanchez 2023). DD024 does NOT duplicate this. Instead:
- Connectome data remains in
cect(the canonical API) - DD024 stores validation data (experimental recordings, behavioral measurements) that
cectdoes not cover - For Randi 2023 functional connectivity, DD024 caches the output of
wormneuroatlasAPI calls as static files to avoid runtime dependencies
OWMeta (DD008)¶
OWMeta is the semantic knowledge graph for C. elegans biological facts. DD024 complements it:
- OWMeta stores structured biological knowledge (cell types, gene functions, anatomical relationships)
- DD024 stores quantitative experimental recordings used specifically for model validation
- In Phase 3+, OWMeta may ingest DD024 data as validation-specific data types
Docker Integration (DD013)¶
DD024 data is baked into the Docker validation stage at build time:
# In DD013 multi-stage Dockerfile
FROM validation-base AS validation
COPY --from=openworm/validation-data:v1.0 /data /opt/openworm/validation/data/
RUN python scripts/verify_validation_data.py
Data is NOT downloaded at runtime — all validation data is pre-packaged for reproducibility and offline CI.
Alternatives Considered¶
1. Download All Datasets Upfront¶
Rejected: Multi-GB downloads slow onboarding; some datasets require API access or author permission. A phased acquisition pipeline is more practical.
2. Rely on Published Summary Statistics Only¶
Rejected: Need raw traces for correlation matrix computation (DD010 Tier 2) and waveform-level comparisons. Summary statistics are insufficient for rigorous validation.
3. Custom Data Format¶
Rejected: Use existing community formats (NWB, WCON per DD021) to maximize interoperability and reduce maintenance burden.
4. Manual Data Curation Only¶
Rejected: Automated pipeline with manifest.json, checksums, and verification scripts ensures reproducibility and freshness across contributors.
Quality Criteria¶
- Provenance metadata: All acquired datasets must have provenance metadata (source, version, download date, checksum) in their
README.md. - Refreshable cache: Cached data must be refreshable without breaking downstream validation scripts.
- Version pinning: Dataset versions must be pinned in
checksums.sha256for reproducibility. - CI compatibility: Data acquisition and verification scripts must run successfully in CI (Docker environment per DD013).
Boundaries (Out of Scope)¶
-
Raw imaging data: We store derived/processed data (correlation matrices, extracted features, digitized traces), not raw calcium imaging volumes or EM stacks. Raw data is too large for Git and available from original authors.
-
Proprietary or restricted data: Only openly redistributable data (CC-BY, CC0, or fair-use digitization from published figures) is included. Data requiring DTA or institutional agreement is documented in the manifest but not stored.
-
Simulation output data: DD024 stores experimental data for validation. Simulated reference outputs (baseline scores, expected trajectories) are generated by DD010/DD013 CI pipeline, not pre-stored.
-
Data analysis scripts: Scripts that use validation data (correlation computation, feature extraction) live in their respective DD repositories (DD010, DD021). DD024 only stores data and verification scripts.
Configuration¶
# openworm.yml section
validation:
data_path: /opt/openworm/validation/data/ # Docker default
data_version: "v1.0" # Tag in openworm/validation-data
verify_checksums: true # Verify on Docker build
References¶
- Randi F et al. (2023). "Neural signal propagation atlas of Caenorhabditis elegans." Nature 623:406-414.
- Yemini E et al. (2013). "A database of Caenorhabditis elegans behavioral phenotypes." Nature Methods 10:877-879.
- Thomas JH (1990). "Genetic analysis of defecation in Caenorhabditis elegans." Genetics 124:855-872.
- Raizen DM, Avery L (1994). "Electrical activity and behavior in the pharynx of Caenorhabditis elegans." Neuron 12:483-495.
- O'Hagan R, Chalfie M, Bhatt R (2005). "The MEC-4 DEG/ENaC channel of Caenorhabditis elegans touch receptor neurons transduces mechanical signals." Nature Neurosci 8:43-50.
- Chalfie M et al. (1985). "The neural circuit for touch sensitivity in Caenorhabditis elegans." J Neurosci 5:956-964.
- Collins KM et al. (2016). "Activity of the C. elegans egg-laying behavior circuit is controlled by competing activation and feedback inhibition." eLife 5:e21126.
- Flavell SW, Raizen DM, You YJ (2020). "Behavioral States." Genetics 216:315-332.
- Goodman MB, Hall DH, Avery L, Bhatt R (1998). "Active currents regulate sensitivity and dynamic range in C. elegans neurons." Neuron 20:763-772.
- Pearl J, Mackenzie D (2018). The Book of Why. Basic Books. (Motivates causal/interventional validation data.)
Integration Contract¶
Inputs (What This Subsystem Consumes)¶
| Input | Source | Description |
|---|---|---|
| Published papers | PubMed / journal websites | Source material for digitization |
wormneuroatlas API |
PyPI package | Randi 2023, CeNGEN programmatic access |
ConnectomeToolbox (cect) |
PyPI package | Connectome data (not stored in DD024, but referenced) |
| Supplement files | Journal supplement pages | Raw data tables from publications |
Outputs (What This Subsystem Produces)¶
| Output | Consumer DD | Description |
|---|---|---|
| Electrophysiology CSVs | DD001, DD005, DD007, DD019 (Tier 1) | Patch-clamp, V-clamp, channel kinetics |
| Functional connectivity matrices | DD001, DD005, DD010 (Tier 2) | Randi 2023 302x302 .npy files |
| Behavioral kinematic data | DD001-DD003, DD007, DD009, DD018, DD019, DD010 (Tier 3) | WCON files, defecation/pumping CSVs |
| Intervention/perturbation data | DD006, DD007, DD010, DD018, DD019 (Tier 4) | Ablation, mutant, and knockout phenotype CSVs |
| Docker data volume | DD013 (Docker build) | /opt/openworm/validation/data/ tree |
| Data manifest | DD010 (validation runner) | manifest.json mapping datasets to DDs and tiers |
Coupling Dependencies¶
| I Depend On | DD | What Breaks If They Change |
|---|---|---|
| Neuron naming convention | DD020 (ConnectomeToolbox) | If neuron IDs change in cect, cached Randi 2023 matrix column labels may break |
| WCON format spec | DD021 | If WCON version changes, kinematics files may need re-export |
| Docker build pipeline | DD013 | If Docker stage names or paths change, data COPY step breaks |
| Depends On Me | DD | What Breaks If I Change |
|---|---|---|
| All validation tiers | DD010 | If data format or file paths change, validation scripts break |
| CI pipeline | DD013 | If Docker data volume path changes, docker compose run validate can't find data |
| All subsystem DDs | DD001-DD019 | If a dataset is removed or reformatted, the consuming DD's validation fails |
- Approved by: Pending
- Implementation Status: Proposed
-
Next Actions:
-
Create
openworm/validation-dataGitHub repository - Acquire Phase A datasets (7 datasets, ~18 hours)
- Write
verify_validation_data.pyscript - Integrate into DD013 Docker build
- Announce in next board sync for contributor help with digitization tasks