DD025: Protein Foundation Model Pipeline for Ion Channel Kinetics¶
- Status: Proposed (Phase A / Phase 1)
- Author: OpenWorm Core Team
- Date: 2026-02-22
- Supersedes: None (extracted from DD017 Component 3)
- Related: DD001 (Neural Circuit), DD005 (Cell-Type Specialization), DD017 (Hybrid ML Framework), DD010 (Validation Framework)
Phase: Phase A (cross-validation) / Phase 1 (integration) | Layer: ML/Structural Biology
TL;DR¶
Predict ion channel kinetics (HH parameters: V_half, slope, tau) from amino acid sequences using protein foundation models (AlphaFold 3, BioEmu-1, ESM Cambrian). This expands the calibration set for DD005 from ~20 neurons (limited by patch-clamp electrophysiology) toward all 128 neuron classes (limited only by sequence availability). Cross-validation against known channels begins in Phase A; predictions feed into DD005 calibration as structure-informed priors in Phase 1.
Goal & Success Criteria¶
Goal: Build a computational pipeline that predicts Hodgkin-Huxley kinetic parameters for C. elegans ion channels from protein sequence, validated against channels with experimentally measured kinetics.
| Criterion | Target | Phase | DD010 Tier |
|---|---|---|---|
| Primary: Cross-validation on known channels | < 30% relative error on HH parameters (V_half, slope, tau) | Phase A | Tier 1 (non-blocking) |
| Secondary: End-to-end simulation improvement | Predicted parameters inserted into simulation do not degrade DD010 Tier 2 or Tier 3 scores below acceptance thresholds | Phase 1 | Tier 2/3 (blocking) |
| Tertiary: Coverage expansion | Predictions available for ≥80% of ion channel genes expressed in CeNGEN | Phase 1 | Non-blocking |
Before: DD005 calibrates expression→conductance using ~20 neurons with patch-clamp data. Remaining 108 classes extrapolate from this small training set.
After: Structure-based kinetics predictions available for most C. elegans ion channels. DD005 uses these as calibration priors where electrophysiology is unavailable.
Deliverables¶
| Artifact | Path | Format | Phase |
|---|---|---|---|
| Channel kinetics predictions | foundation_params/output/channel_kinetics_predictions.csv |
CSV (channel, V_half_m, k_m, tau_m, V_half_h, k_h, tau_h, g_max_scale, E_rev, confidence) | Phase A |
| Cross-validation report | foundation_params/output/cross_validation_report.json |
JSON (per-channel predicted vs. measured, error metrics) | Phase A |
| Foundation model inference scripts | foundation_params/scripts/ |
Python | Phase A |
| Per-neuron-class HH parameters | foundation_params/output/per_class_hh_params.csv |
CSV (128 neuron classes × channel parameters) | Phase 1 |
| Integration adapter for DD005 | foundation_params/scripts/generate_dd005_priors.py |
Python | Phase 1 |
Repository & Issues¶
| Item | Value |
|---|---|
| Repository | openworm/openworm-ml (new repo, shared with DD017) [TO BE CREATED] |
| Subdirectory | foundation_params/ |
| Issue label | dd025 |
| Milestone | Phase A — Foundation Model Channel Kinetics |
| Example PR title | DD025: cross-validation of BioEmu-1 kinetics predictions on 50 channels |
Quick Action Reference¶
| Question | Answer |
|---|---|
| Phase | Phase A (cross-validation), Phase 1 (integration) |
| Layer | ML/Structural Biology — parallel track derisking DD005 |
| What does this produce? | Predicted HH kinetic parameters for C. elegans ion channels from protein sequence + structure |
| Success metric | Cross-validation <30% relative error on known channels; end-to-end DD010 Tier 2 scores not degraded |
| Repository | openworm/openworm-ml/foundation_params/ — issues labeled dd025 |
| Config toggle | ml.foundation_params: true in openworm.yml |
| Build & test | python foundation_params/scripts/run_cross_validation.py |
How to Build & Test¶
Prerequisites¶
- Python 3.10+, PyTorch, ESM library, internet access for model downloads
- GPU recommended: CUDA (Linux) or MPS (macOS) for foundation model inference
Getting Started (Environment Setup)¶
Path A — Docker (recommended):
# From the OpenWorm meta-repo (see DD013 Simulation Stack Architecture)
docker compose build ml
# Then skip to Step 3 below — dependencies are pre-installed in the container
Cross-reference: DD013 for the full Docker Compose stack setup.
Path B — Native:
git clone https://github.com/openworm/openworm-ml.git
cd openworm-ml
pip install -e ".[dev]" # includes PyTorch, ESM/AlphaFold dependencies
GPU Support
PyTorch with CUDA (Linux) or MPS (macOS) is strongly recommended for foundation model inference. CPU-only mode works but is significantly slower for structure prediction and conformational sampling.
Step-by-step¶
# Clone and set up (if not done above)
git clone https://github.com/openworm/openworm-ml.git
cd openworm-ml/foundation_params
# Install dependencies
pip install -r requirements.txt # torch, esm, biopython, pandas, numpy
# Step 1: Download C. elegans ion channel sequences from WormBase
python scripts/fetch_channel_sequences.py \
--output data/celegans_channel_sequences.fasta
# Step 2: Run structure prediction (or download from AlphaFold DB)
python scripts/predict_structures.py \
--sequences data/celegans_channel_sequences.fasta \
--output data/predicted_structures/
# Step 3: Run kinetics prediction pipeline
python scripts/predict_kinetics.py \
--structures data/predicted_structures/ \
--output output/channel_kinetics_predictions.csv
# Step 4: Cross-validate against known channels
python scripts/run_cross_validation.py \
--predictions output/channel_kinetics_predictions.csv \
--ground_truth data/known_channel_kinetics.csv \
--output output/cross_validation_report.json
# Green light: relative error < 30% on HH parameters
# Step 5 (Phase 1): Generate DD005 calibration priors
python scripts/generate_dd005_priors.py \
--predictions output/channel_kinetics_predictions.csv \
--cengen data/CeNGEN_L4_expression.csv \
--output output/per_class_hh_params.csv
Scripts that don't exist yet¶
| Script | Status | Phase |
|---|---|---|
scripts/fetch_channel_sequences.py |
[TO BE CREATED] |
Phase A |
scripts/predict_structures.py |
[TO BE CREATED] |
Phase A |
scripts/predict_kinetics.py |
[TO BE CREATED] |
Phase A |
scripts/run_cross_validation.py |
[TO BE CREATED] |
Phase A |
scripts/generate_dd005_priors.py |
[TO BE CREATED] |
Phase 1 |
Context & Background¶
The Problem: Limited Electrophysiology Data¶
DD001 uses the same generic HH parameters for all 302 neurons. DD005 proposes specializing via CeNGEN single-cell transcriptomics, but the mapping from mRNA transcript counts to functional conductance densities is a hard, unsolved problem. The current plan (DD005) proposes a hand-crafted scaling:
g_max(neuron_class, channel) = baseline_g * expression_level(neuron_class, channel) / max_expression(channel)
This is biologically naive — mRNA levels don't linearly predict protein abundance, protein abundance doesn't linearly predict functional conductance, and post-translational modification, trafficking, and localization all intervene. Only ~20 neuron types have patch-clamp electrophysiology for calibration.
The Solution: Protein Foundation Models¶
A rapidly expanding ecosystem of protein foundation models now enables prediction of ion channel kinetics directly from sequence:
Step 1: Gene sequence → Protein structure
Tool: AlphaFold 3, Boltz-2, or Protenix
Input: C. elegans ion channel gene sequences (from WormBase)
Output: Predicted 3D protein structures
Step 2: Protein structure → Conformational dynamics
Tool: BioEmu-1 (Microsoft, 100,000x MD speed)
Input: Predicted structures
Output: Gating transition ensembles (open ↔ closed states)
Step 3: Conformational dynamics → HH kinetics
Tool: ML predictor (trained on channels with known kinetics)
Input: Conformational landscape + known electrophysiology database
Output: Predicted HH parameters (V_half, k, tau for each gate)
Step 4: Feed into DD001/DD005 HH ODEs
Output parameters go directly into NeuroML (or differentiable backend)
Why This Changed: BioEmu-1¶
DD005 Alternative #1 originally rejected this approach because molecular dynamics was "computationally expensive (days-weeks per channel)." BioEmu-1 (Microsoft, 2025) changed this calculus: conformational ensembles at 100,000x MD speed make gating parameter prediction feasible for all C. elegans channels.
Why Phase A (Not Phase 3)¶
This pipeline was originally specified as DD017 Component 3 in Phase 3 (months 7-12). It belongs in Phase A because:
- Derisks DD005: If DD005's power-law expression→conductance scaling fails for certain neuron classes, DD025 predictions are ready immediately as a fallback
- No infrastructure dependencies: Inputs (WormBase sequences, literature kinetics) are available today. No Docker stack, no simulation infrastructure needed
- Available tools: AlphaFold 3, BioEmu-1, ESM Cambrian are all publicly available with open-source code
- Independent scope: Distinct inputs/outputs, timeline, and validation criteria from DD017 Components 1, 2, and 4
Technical Approach¶
Foundation Models for Each Pipeline Step¶
| Step | Model | What It Provides | Advantage Over Generic Tools |
|---|---|---|---|
| 1 (Structure) | AlphaFold 3 | Protein-ion complex structures | Predicts bound ions/lipids critical for channel selectivity |
| 1 (Structure) | Boltz-2 | Open-source, single-GPU | Matches AF3 accuracy without cloud dependency |
| 1 (Structure) | Protenix | Apache 2.0 AF3 reproduction | No licensing restrictions for integration |
| 1→2 (Dynamics) | BioEmu-1 | Conformational ensembles at 100,000x MD speed | Directly predicts gating transitions (open ↔ closed states) |
| 2 (Embeddings) | ESM Cambrian | Protein language model (300M-6B params) | Outperforms ESM-2; captures functional properties from sequence alone |
| 2 (Embeddings) | SaProt | Structure-aware protein LM | Combines sequence + 3Di structural tokens; better for mutation effects |
BioEmu-1 is particularly significant for Step 2: instead of training a separate ML predictor on the small dataset of ~50-100 channels with known kinetics, BioEmu-1 can directly simulate the gating dynamics of any predicted channel structure and extract V_half, slope, and tau from the conformational landscape. This converts the problem from "learn kinetics from sparse data" to "simulate kinetics from abundant structures."
Implementation¶
import esm # Meta/CZI protein language model
class ChannelKineticsPredictor(torch.nn.Module):
"""Predicts HH parameters from protein sequence embedding."""
def __init__(self):
super().__init__()
self.esm_model = esm.pretrained.esm2_t33_650M_UR50D()
self.predictor = torch.nn.Sequential(
torch.nn.Linear(1280, 256), # ESM2 embedding dim
torch.nn.ReLU(),
torch.nn.Linear(256, 64),
torch.nn.ReLU(),
torch.nn.Linear(64, 8), # V_half_m, k_m, tau_m, V_half_h, k_h, tau_h, g_max_scale, E_rev
)
def forward(self, protein_sequence):
embedding = self.esm_model(protein_sequence)
hh_params = self.predictor(embedding.mean(dim=1)) # Pool over residues
return hh_params
# Predict kinetics for C. elegans K_slow channel (egl-36)
egl36_sequence = load_wormbase_sequence("egl-36")
predicted_params = predictor(egl36_sequence)
# → V_half_m=-22mV, k_m=5.3, tau_m=12ms, ...
# Feed directly into DD001 HH model
Training Data for Kinetics Prediction¶
Approximately 50-100 ion channels across species have both:
- Known 3D structures (from X-ray crystallography or cryo-EM)
- Known HH kinetic parameters (from patch-clamp electrophysiology)
This is a small dataset but focused. Transfer learning from protein language model representations helps. The key channels to get right:
| Channel Family | C. elegans Gene | Mammalian Homolog | Known Kinetics? |
|---|---|---|---|
| Voltage-gated K+ (Kv) | egl-36, kvs-1, shk-1 | Kv1-Kv12 | Yes (mammalian) |
| Voltage-gated Ca2+ (Cav) | egl-19, unc-2, cca-1 | Cav1-Cav3 | Yes (mammalian) |
| Ca-activated K+ (KCa) | slo-1, slo-2 | BK, SK | Yes |
| Leak (K2P) | twk-* family | TASK, TREK | Partial |
| TRP channels | osm-9, ocr-* | TRPV, TRPA | Partial |
Strategic Importance¶
This pipeline creates a direct dependency on CZI's ESM and DeepMind's AlphaFold. The pitch to funders becomes:
"We don't compete with your foundation models — we consume them. Your ESM3 predicts our channel kinetics. Our mechanistic simulation is the testbed that validates whether your predictions produce real organism behavior. Fund us, and we provide the multi-scale benchmark that proves your models work."
Validation¶
Predicted parameters are validated in two ways:
-
Cross-validation on known channels (Phase A): Leave-one-out cross-validation on ~50-100 channels with known kinetics. Train on 80%, predict on 20%, compare predicted vs. measured HH parameters. Target: <30% relative error.
-
End-to-end validation (Phase 1): Insert predicted per-neuron-class parameters into the full simulation. Run DD010 validation. If Tier 2 functional connectivity is not degraded below acceptance thresholds, the pipeline is adding value.
Integration Contract¶
Inputs (What This Subsystem Consumes)¶
| Input | Source | Variable | Format | Units |
|---|---|---|---|---|
| Ion channel gene sequences | WormBase | Protein sequences for C. elegans ion channels | FASTA | amino acids |
| CeNGEN expression data | DD005 / DD008 | Per-class transcript levels | CSV | TPM |
| Known channel kinetics (training set) | Published electrophysiology + PDB | ~50-100 channels with measured HH params | CSV | mV, ms, mS/cm² |
Outputs (What This Subsystem Produces)¶
| Output | Consumer DD | Variable | Format | Units |
|---|---|---|---|---|
| Predicted channel kinetics | DD005 | Per-channel HH parameters (V_half, k, tau) | CSV | mV, ms, mS/cm² |
| Per-neuron-class HH parameters | DD001 | Per-class conductances from sequence + expression | CSV / YAML | mS/cm², mV, ms |
| Cross-validation report | Internal | Predicted vs. measured, error metrics | JSON | mixed |
Configuration (openworm.yml Section)¶
ml:
# DD025: Foundation model parameters
foundation_params: false # Use structure-predicted channel kinetics
esm_model: "esm2_t33_650M"
kinetics_predictor: "models/channel_kinetics_v1.pt"
Coupling Dependencies¶
| I Depend On | DD | What Breaks If They Change |
|---|---|---|
| CeNGEN data | DD005 / DD008 | If expression data versioning changes, per-class predictions change |
| HH equations | DD001 | If channel model equations change, predicted parameters must be remapped |
| Depends On Me | DD | What Breaks If I Change |
|---|---|---|
| Cell-type specialization (if using predicted kinetics) | DD005 | If predicted conductances change, per-class models change |
| Neural circuit (if using per-class params) | DD001 | If per-class parameters change, simulation behavior changes |
Boundaries (Explicitly Out of Scope)¶
- Differentiable simulation backend: That is DD017 Component 1 (Phase 3).
- SPH surrogate model: That is DD017 Component 2 (Phase 3).
- Learned sensory transduction: That is DD017 Component 4 (Phase 3).
- Replacing DD005's CeNGEN approach: DD025 runs in parallel. If DD005's power-law scaling works, DD025 predictions serve as independent validation. If DD005 fails for certain neuron classes, DD025 predictions substitute immediately.
- Neuropeptide-GPCR binding affinity: That is DD006's use of foundation models for a different application.
Implementation Roadmap¶
Phase A: Cross-Validation (~20 hours)¶
- Curate training data: Collect ~50-100 channels with both known structure (PDB) and known HH kinetics (electrophysiology literature)
- Set up inference pipeline: ESM embeddings + BioEmu-1 conformational sampling for C. elegans channel sequences
- Run cross-validation: Leave-one-out on known channels, report relative error on V_half, slope, tau
- Deliverable:
channel_kinetics_predictions.csv+cross_validation_report.json
Phase 1: Integration with DD005 (~12 hours)¶
- Generate per-neuron-class parameters: Combine DD025 kinetics predictions with CeNGEN expression to produce per-class HH parameter sets
- Feed into DD005 calibration: DD025 predictions serve as structure-informed priors where electrophysiology is unavailable
- End-to-end validation: Insert predicted parameters into simulation, run DD010 Tier 2 + Tier 3
- Deliverable:
per_class_hh_params.csvintegrated into DD005 pipeline
Quality Criteria¶
- Cross-validation: Leave-one-out cross-validation on known channels must achieve < 30% relative error on HH parameters.
- End-to-end: Predicted parameters inserted into the full simulation must not degrade DD010 Tier 2 or Tier 3 scores below acceptance thresholds.
- Reproducibility: All predictions must be reproducible from sequence input alone (no manual tuning).
- Provenance: Each predicted parameter must track which foundation model and version produced it.
Relationship to DD005 and DD017¶
DD005 (Cell-Type Specialization): DD025 does not replace the CeNGEN expression-based approach — it runs in parallel. If DD005's power-law scaling works, DD025 predictions serve as independent validation. If DD005 fails for certain neuron classes, DD025 predictions substitute immediately.
DD017 (Hybrid ML Framework): DD025 was originally DD017 Component 3. Components 1 (differentiable backend), 2 (SPH surrogate), and 4 (learned sensory) remain in DD017 as Phase 3 work. DD025 was extracted because it has no infrastructure dependencies and derisks DD005's uncertain mapping.
References¶
-
Jumper J et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature 596:583-589. Structure prediction for channel kinetics pipeline.
-
Lin Z, Akin H, Rao R, et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379:1123-1130. ESM2/ESM3 protein language model.
-
Wohlwend J et al. (2025). "Boltz-2: Open-source, single-GPU protein structure prediction." GitHub. Open-source alternative to AlphaFold 3.
-
Zheng S et al. (2025). "BioEmu-1: Protein conformational ensembles at 100,000x MD speed." Microsoft Research. Key enabler — makes channel dynamics prediction feasible at scale.
-
Taylor SR et al. (2021). "Molecular topography of an entire nervous system." Cell 184:4329-4347. CeNGEN database — source of ion channel expression data.
- Approved by: Pending
- Implementation Status: Proposed
-
Next Actions:
-
Curate training dataset: ~50-100 channels with known structure + kinetics
- Download C. elegans ion channel sequences from WormBase
- Set up ESM + BioEmu-1 inference pipeline
- Run cross-validation, assess error rates
- If <30% error: generate per-neuron-class predictions for DD005 integration