DD008: Data Integration Pipeline and OWMeta Knowledge Graph¶
- Status: Accepted (with proposed extensions)
- Author: OpenWorm Core Team
- Date: 2026-02-14
- Supersedes: None
- Related: All other DDs (data layer for entire project)
TL;DR¶
OWMeta is a semantic knowledge graph providing unified programmatic access to 15+ biological data sources (WormBase, CeNGEN, Cook connectome, Ripoll-Sanchez neuropeptides, Randi functional connectivity, etc.). All modeling code should access data through OWMeta, not by parsing raw files. Success: all downstream DDs can query data via a unified API; ID consistency across all datasets with every neuron/cell ID mapping to the WBbt ontology.
Quick Action Reference¶
| Question | Answer |
|---|---|
| Phase | Phase A |
| Layer | Data Integration — see Phase Roadmap |
| What does this produce? | Unified data access layer (OWMeta) for connectome, CeNGEN expression, cell positions, neuropeptide interactions — all via Python API |
| Success metric | All downstream DDs (DD001-DD009) can query data via OWMeta; ID consistency (all neuron/cell IDs map to WBbt ontology) |
| Repository | openworm/owmeta + openworm/owmeta-core — issues labeled dd008 |
| Config toggle | data.backend: owmeta (recommended) or data.backend: direct (legacy) in openworm.yml |
| Build & test | docker compose run shell python -c "import owmeta_core" (installs?), query 302 neurons (returns correct count?) |
| Visualize | DD014 geometry/cell_metadata.json — cell names, types, lineage for viewer tooltips and search |
| CI gate | OWMeta installation + basic query test blocks merge for data-layer changes |
| --- |
Goal & Success Criteria¶
| Criterion | Target | DD010 Tier |
|---|---|---|
| Primary: ID consistency | All neuron/cell IDs map to WBbt ontology; no orphaned IDs | Tier 1 (blocking) |
| Secondary: Dataset ingestion | All Phase 1-3 datasets ingested and queryable via OWMeta | Tier 1 (blocking) |
| Tertiary: Downstream migration | c302, Sibernetic init, and validation scripts successfully migrated to OWMeta queries | Tier 2 (blocking) |
Before: Each contributor writes custom parsers for CSV/JSON files from different sources; IDs are inconsistent across datasets (Cook uses "AVAL," WormBase uses "WBGene00006748"); data versions drift.
After: Single connect("openworm_data") call provides unified access to all datasets with normalized IDs, versioned data, and provenance metadata.
Deliverables¶
| Artifact | Path / Location | Format | Example |
|---|---|---|---|
| OWMeta data bundles | openworm_data bundle (baked into Docker image or downloaded at build) |
RDF graph (OWMeta bundle) | connect("openworm_data") |
| Ingestion scripts per dataset | openworm/owmeta repo, per-dataset scripts |
Python | ingest_ripoll_sanchez.py, ingest_witvliet.py |
| Python query API | owmeta-core + owmeta packages |
Python package (pip) | pip install owmeta-core owmeta |
| Entity types | OWMeta schema | Python classes (RDF-backed) | Neuron, Muscle, Connection, Gene, Channel, Cell |
| Cell metadata for viewer | OME-Zarr: geometry/cell_metadata.json |
JSON | Cell names, types, lineage, WormAtlas links |
| Configuration schema | openworm.yml data: section |
YAML | data.backend: owmeta |
Repository & Issues¶
| Item | Value |
|---|---|
| Repository | openworm/owmeta + openworm/owmeta-core |
| Issue label | dd008 |
| Milestone | Phase 1-3: Data Integration |
| Branch convention | dd008/description (e.g., dd008/ingest-ripoll-sanchez) |
| Example PR title | DD008: Ingest Ripoll-Sanchez neuropeptide-receptor pairs into OWMeta |
How to Build & Test¶
Prerequisites¶
- Docker with
docker compose(DD013 simulation stack) - OR: Python 3.10+, pip
Getting Started (Environment Setup)¶
Path A — Docker (recommended):
# Cross-ref: DD013 Simulation Stack Architecture for full Docker setup
docker compose build
docker compose run shell python -c "import owmeta_core; print(owmeta_core.__version__)"
# Then skip to Step 3 below — OWMeta packages are pre-installed in the container
Cross-reference: DD013 for the containerized simulation stack.
Path B — Native:
git clone https://github.com/openworm/PyOpenWorm.git
cd PyOpenWorm
pip install -e . # includes RDFLib, ZODB
Additionally, for connectome data access:
pip install connectometoolbox # for connectome data access (see DD020)
Step-by-step¶
# Step 1: Install OWMeta packages
pip install owmeta-core owmeta
# Step 2: Verify import works
python -c "import owmeta_core; print(owmeta_core.__version__)"
# Step 3: Verify data queries work
python -c "
from owmeta_core import connect
conn = connect('openworm_data')
# Verify connectome query returns 302 neurons
neurons = list(conn.query(Neuron)())
assert len(neurons) >= 302, f'Expected 302+ neurons, got {len(neurons)}'
print(f'Connectome loaded: {len(neurons)} neurons')
"
# Step 4: Docker-based verification
docker compose run shell python -c "import owmeta_core; print(owmeta_core.__version__)"
# Step 5: Backward compatibility (direct backend)
# When data.backend: "direct", OWMeta is not required
# Modeling code falls back to direct file access
docker compose run quick-test # with data.backend: "direct"
Scripts that don't exist yet¶
| Script | Status | Tracking |
|---|---|---|
ingest_ripoll_sanchez.py |
[TO BE CREATED] |
openworm/owmeta#TBD |
ingest_witvliet.py |
[TO BE CREATED] |
openworm/owmeta#TBD |
ingest_randi.py |
[TO BE CREATED] |
openworm/owmeta#TBD |
Green light criteria¶
import owmeta_coresucceeds in Docker- Query returns >= 302 neurons
docker compose run quick-testpasses withdata.backend: "direct"(backward compatibility)
How to Visualize¶
DD014 viewer layer: geometry/cell_metadata.json for tooltips, search, and cell identification.
| Viewer Feature | Specification |
|---|---|
| Layer | geometry/ (cell metadata overlay) |
| Data source | OME-Zarr: geometry/cell_metadata.json — cell names, types, lineage, WormAtlas links |
| What you should SEE | Clicking any cell in the 3D viewer shows its WBbt ID, cell type, lineage, and links to WormAtlas. Search by cell name returns the correct 3D position. All 302 neurons + non-neural cells are labeled and searchable. |
| Color mapping | Cell type color coding: neurons (by class), muscles (by quadrant), intestinal cells, hypodermal cells |
Technical Approach¶
Use OWMeta as the Canonical Data Access Layer¶
All modeling code (c302, Sibernetic initialization, validation scripts) MUST access biological data through OWMeta, not by directly parsing raw files.
Rationale:
- Single source of truth: WormBase IDs, cell names, gene symbols are normalized
- Versioned: OWMeta tracks dataset versions (e.g., WS298, CeNGEN v1.0)
- Queryable: Semantic queries like "Get all neurons in the nerve ring expressing unc-2" are one-liners
- Extensible: New datasets (Ripoll-Sanchez neuropeptides, Witvliet development) can be added without modifying downstream code
Reconciliation with DD020 (Connectome Data Access)¶
OWMeta and cect (DD020) serve complementary purposes:
- Phase 1-2 (current): Use
cectdirectly for connectome data (DD020). OWMeta is optional for semantic queries. - Phase 3+ (future): OWMeta wraps
cectinternally. Consuming DDs can use either API.
Contributors should follow DD020 for connectome-specific data access and use OWMeta when broader semantic queries across multiple data types are needed.
OWMeta Entity Types¶
| Entity | Properties | Example |
|---|---|---|
| Neuron | name, WormBase ID, type (sensory/inter/motor), position | AVAL (WBbt:0006748) |
| Muscle | name, quadrant, row, innervation | MDR05 |
| Connection | pre, post, type (syn/gap/peptide), weight | AVAL → AVAR (gap) |
| Gene | symbol, WormBase ID, expression by cell | unc-2 (WBGene00006765) |
| Channel | gene, type (Kv/Cav/Cl), NeuroML model | unc-2 → ca_boyle_chan |
| Cell | WBbt ID, lineage, anatomy, neighbors | int5 (WBbt:0005193) |
Example Queries¶
Get all neurons expressing unc-2:
from owmeta_core import connect
conn = connect("openworm_data")
neurons = conn.query(Neuron)().get_neuron_type_by_expression("unc-2")
for n in neurons:
print(f"{n.name()}: {n.expression_level('unc-2')} TPM")
Get connectome for AVAL:
aval = Neuron(name="AVAL")
connections = aval.connection.get()
for c in connections:
print(f"{c.pre_cell()} -> {c.post_cell()}: {c.connection_type()} weight={c.weight()}")
Get 3D position for all intestinal cells:
intestine_cells = Cell.query(lineage_contains="E") # E lineage = intestine
for cell in intestine_cells:
pos = cell.position_3d() # From WormAtlas or Witvliet EM
print(f"{cell.name()}: {pos}")
Data Ingestion Priority¶
| Dataset | OWMeta Status | Priority | Action |
|---|---|---|---|
| Cook 2019 connectome | Integrated | -- | Maintain |
| WormBase WS298 | Integrated | -- | Maintain (archival) |
| WormAtlas anatomy | Partial | High | Complete integration |
| CeNGEN L4 expression | Integrated | -- | Maintain |
| CeNGEN L1 expression | Not yet | Medium | Add in Phase 1 |
| Witvliet 2021 dev. connectomes | Not yet | High | Add for Phase 6 (development) |
| Ripoll-Sanchez neuropeptides | Not yet | High | Add for Phase 2 (DD006) |
| Randi 2023 functional connectivity | Not yet | High | Add for validation |
| Packer 2019 embryonic scRNA-seq | Not yet | Medium | Add for Phase 6 |
| Ben-David 2021 eQTLs | Not yet | Low | Phase 6+ |
OWMeta Update Process (For Contributors)¶
DO NOT directly edit OWMeta. Updates go through a review process:
- Propose a dataset addition via GitHub issue on the OWMeta repository
- Provide data source: DOI, URL, file format, license
- Provide mapping: How identifiers in the new dataset map to WormBase IDs or WBbt ontology
- Write an ingestion script following OWMeta patterns
- Submit PR with ingestion script + documentation + example queries
- Maintainer review: OWMeta maintainers check for ID conflicts, data quality, schema consistency
- Merge and version: New dataset becomes available in next OWMeta release
Do not:
- Parse raw CSV/JSON files directly in modeling code
- Hardcode cell IDs or gene names
- Duplicate data across repositories
Alternatives Considered¶
1. Direct File Parsing (No OWMeta)¶
Rejected: Every contributor writing their own CSV parser leads to:
- ID mismatches (Cook uses "AVAL," WormBase uses "WBGene00006748")
- Version drift (contributor uses old WormBase release)
- Code duplication
2. SQL Database Instead of Semantic Graph¶
Rejected: Semantic RDF graphs better capture biological relationships (is-a, part-of, expressed-in, connected-to) than rigid SQL schemas. OWMeta uses RDF + SPARQL.
3. Use WormBase API Directly¶
Partial use: WormBase REST API is a data source for OWMeta. But WormBase lacks:
- Connectome data (ConnectomeToolbox)
- Single-cell expression (CeNGEN)
- 3D positions (WormAtlas)
OWMeta aggregates WormBase + all other sources.
Quality Criteria¶
-
All Data Versioned: OWMeta must track the version of every source dataset (e.g., "WormBase WS298," "CeNGEN v1.0 L4").
-
ID Consistency: All neuron/cell identifiers must map to WBbt ontology. No orphaned IDs.
-
Provenance Metadata: Every datum includes source DOI, access date, confidence level (experimental/inferred).
-
API Stability: OWMeta query syntax must remain backward-compatible across updates. Deprecate gracefully.
Boundaries (Explicitly Out of Scope)¶
What This Design Document Does NOT Cover:¶
-
Raw experimental data storage: OWMeta stores curated metadata and normalized identifiers, not raw imaging files, electrophysiology recordings, or EM micrographs. Raw data remains at its source (WormBase, CeNGEN portal, journal supplements).
-
Real-time data ingestion: OWMeta is a batch pipeline. Datasets are ingested via scripts, reviewed, and released as versioned bundles. There is no streaming or live-update pathway.
-
Data visualization: Rendering and interactive exploration of data are handled by DD014. OWMeta produces the data; DD014 displays it.
-
Connectome graph algorithms: Graph analysis, bilateral symmetry metrics, and network topology computations are handled by ConnectomeToolbox (
cect) per DD020. OWMeta provides semantic queries across data types;cectprovides direct connectome-specific analysis.
Context & Background¶
OpenWorm integrates data from 15+ sources: WormBase, WormAtlas, CeNGEN, Cook connectome, Witvliet developmental data, Ripoll-Sanchez neuropeptides, Randi functional connectivity, Schafer kinematics, and more. These datasets use different formats, identifiers, and coordinate systems.
The challenge: A contributor implementing DD005 (cell-type specialization) must pull CeNGEN expression, map neuron IDs to CeNGEN classes, extract channel genes, and generate NeuroML. Without a unified data layer, this requires writing custom parsers for each dataset.
The solution: OWMeta (openworm.org/OWMeta) — a semantic knowledge graph providing unified programmatic access to all OpenWorm-relevant biological data.
References¶
- OWMeta Documentation: https://pypi.org/project/owmeta-core/
- ConnectomeToolbox: openworm.org/ConnectomeToolbox
- WormBase REST API: https://wormbase.org/about/userguide/for_developers
Integration Contract¶
Inputs / Outputs¶
Inputs (What This Subsystem Consumes)
| Input | Source | Variable | Format | Notes |
|---|---|---|---|---|
| Cook 2019 connectome | wormwiring.org | Neuron adjacency + weights | CSV/Excel → RDF ingestion | Already integrated |
| CeNGEN L4 scRNA-seq | cengen.org | Per-neuron-class TPM values | CSV → RDF ingestion | Already integrated |
| WormAtlas anatomy | wormatlas.org | Cell positions, morphology, EM images | HTML/images → RDF ingestion | Partial |
| Ripoll-Sanchez neuropeptides | Neuron 111:3570 supplement | Peptide-receptor pairs + expression | CSV → RDF ingestion | Not yet ingested into OWMeta (needed for DD006). Note: this data is already available in ConnectomeToolbox (cect) per DD006; the "not yet ingested" status refers specifically to OWMeta's RDF graph. |
| Randi 2023 functional connectivity | Nature 623:406 supplement | 302×302 correlation matrix | NumPy .npy → RDF metadata only | Not yet ingested (needed for DD010) |
| Witvliet 2021 dev. connectomes | Nature 596:257 | Multi-stage connectomes (L1, L4, adult) | CSV → RDF ingestion | Not yet ingested (needed for Phase 6) |
Outputs (What This Subsystem Produces)
| Output | Consumer DD | Variable | Format | Units |
|---|---|---|---|---|
| Neuron adjacency (connectome) | DD001 | Synapse pairs + weights | OWMeta query → Python objects | synapse count |
| Per-class gene expression | DD005 | TPM per gene per neuron class | OWMeta query → DataFrame | TPM |
| Neuropeptide-receptor pairs | DD006 | Peptide ligand → receptor → expressing cells | OWMeta query → edge list | binary (expressed/not) |
| Cell positions (3D) | DD004 | Per-cell x, y, z coordinates | OWMeta query → NumPy array | um |
| Cell ontology IDs | DD004 | Cell name → WBbt ID mapping | OWMeta query → dict | identifiers |
| Cell metadata (for viewer) | DD014 (visualization) | Cell names, types, lineage, WormAtlas links | OME-Zarr: geometry/cell_metadata.json |
mixed |
Repository & Packaging¶
- Repository:
openworm/owmeta+openworm/owmeta-core - Docker stage:
datain multi-stage Dockerfile (new stage) versions.lockkey:owmeta,owmeta_core- Build dependencies:
pip install owmeta-core owmeta - Data bundle: OWMeta data bundle must be downloaded or baked into the Docker image at build time
# versions.lock
owmeta_core:
pypi_version: "0.14.x" # Pin to specific minor version
owmeta:
repo: "https://github.com/openworm/owmeta.git"
commit: "TBD" # Must be updated when OWMeta is revived
Configuration¶
openworm.yml Section:
data:
backend: owmeta # "owmeta" (recommended) or "direct" (legacy file access)
owmeta_bundle: "openworm_data" # OWMeta data bundle name
connectome_dataset: "Cook2019" # Cook2019, Witvliet2021, Varshney2011
cengen_version: "L4_v1.0" # CeNGEN dataset version
| Key | Default | Valid Range | Description |
|---|---|---|---|
data.backend |
owmeta |
owmeta / direct |
Data access backend; direct is legacy fallback |
data.owmeta_bundle |
"openworm_data" |
String | OWMeta data bundle name |
data.connectome_dataset |
"Cook2019" |
Cook2019, Witvliet2021, Varshney2011 |
Which connectome dataset to use |
data.cengen_version |
"L4_v1.0" |
String | CeNGEN dataset version pin |
How to Test (Contributor Workflow)¶
# Per-PR quick test
docker compose run shell python -c "import owmeta_core; print(owmeta_core.__version__)"
# Check: import succeeds, version prints
# Data query test
docker compose run shell python -c "
from owmeta_core import connect
conn = connect('openworm_data')
neurons = list(conn.query(Neuron)())
assert len(neurons) >= 302, f'Expected 302+ neurons, got {len(neurons)}'
print(f'Connectome loaded: {len(neurons)} neurons')
"
# Check: 302+ neurons returned
# Backward compatibility test
docker compose run quick-test # with data.backend: "direct"
# Check: simulation completes without OWMeta installed
Per-PR checklist:
- [ ]
import owmeta_coresucceeds in Docker - [ ] Neuron count query returns >= 302
- [ ]
quick-testpasses withdata.backend: "direct"(backward compatibility) - [ ] New ingestion scripts include source DOI, version, and ID mapping documentation
- [ ] No orphaned IDs (all IDs map to WBbt ontology)
How to Visualize (DD014 Connection)¶
| OME-Zarr Group | Viewer Layer | Color Mapping |
|---|---|---|
geometry/cell_metadata.json |
Cell metadata overlay | Cell type color coding: neurons (by class), muscles (by quadrant), intestinal, hypodermal |
Reality Check: Phased OWMeta Mandate¶
OWMeta is dormant (last real commit Jul 2024, owmeta-core last updated Mar 2025). The mandate "all code MUST use OWMeta" cannot be enforced immediately. Phased approach:
- Phase 1: OWMeta is optional. Direct file access is acceptable with documented data provenance (source DOI, version, access date).
- Phase 2: OWMeta is recommended. New code should use OWMeta where possible. Migration scripts provided for existing direct-access code.
- Phase 3+: OWMeta is required. All modeling code accesses data through OWMeta. Direct file parsing is prohibited.
Trigger for Phase 2→3 transition: OWMeta is installable on Python 3.12, all Phase 1-2 datasets are ingested, and at least 3 downstream consumers (c302, Sibernetic init, validation) have been successfully migrated.
Reconciliation with DD020 (Connectome Data Access Policy)¶
DD020 specifies ConnectomeToolbox (cect, PyPI v0.2.7) as the canonical API for connectome data access. OWMeta and cect serve complementary purposes and should coexist:
| Aspect | cect (DD020) |
OWMeta (DD008) |
|---|---|---|
| Purpose | Direct connectome data access | Semantic knowledge graph (multi-modal) |
| Architecture | Direct Python API | RDF semantic graph |
| Query style | get_instance() → ConnectomeDataset |
connect("openworm_data") → SPARQL-like |
| Data scope | Connectome topology only (30+ datasets) | Connectome + CeNGEN + WormAtlas + lineage + anatomy |
| Maintainer | Active maintainer (commits within days) | OWMeta team (dormant since Jul 2024) |
| Current status | v0.2.7, preprint pending | Working but under-maintained |
| Best for | Direct adjacency matrix access, visualization, cross-dataset comparison | Unified multi-modal biological queries, provenance tracking |
Current recommendation (Phase 1-2): Use cect directly for all connectome queries (see DD020 API contract). This is the actively maintained, stable tool with 30+ dataset readers.
Future integration (Phase 3+): When OWMeta becomes active again and ingests all Phase 1-2 datasets (CeNGEN, Randi 2023, Ripoll-Sanchez, Wang 2024), it should call cect internally as its connectome data provider. Consuming DDs can then use either cect (direct, fast) or OWMeta (semantic, provenance-tracked) depending on their needs.
Action for OWMeta revival: Add a cect ingestion adapter so OWMeta wraps cect readers rather than duplicating connectome parsing logic.
Coupling Dependencies¶
| I Depend On | DD | What Breaks If They Change |
|---|---|---|
| WormBase releases | External | New WormBase releases may change gene IDs or annotations |
| CeNGEN updates | External | New expression data may change downstream conductances |
| ConnectomeToolbox | DD001 | If connectome representation changes, OWMeta ingestion scripts must update |
| Depends On Me | DD | What Breaks If I Change |
|---|---|---|
| Neural circuit (connectome queries) | DD001 | If neuron adjacency format or ID scheme changes, c302 network generation breaks |
| Cell-type specialization (expression data) | DD005 | If CeNGEN query format changes, conductance pipeline breaks |
| Neuropeptides (peptide-receptor data) | DD006 | If peptide interaction data format changes, neuropeptide layer breaks |
| Mechanical cell identity (cell positions) | DD004 | If cell position queries change, particle tagging breaks |
| Validation (experimental data metadata) | DD010 | If data provenance metadata changes, validation data versioning breaks |
- Approved by: OpenWorm Steering
- Implementation Status: Partial (core OWMeta exists, extensions proposed)
-
Next Actions:
-
Ingest Ripoll-Sanchez neuropeptides
- Ingest Witvliet developmental connectomes
- Ingest Randi functional connectivity
- Document canonical query patterns for common tasks