DD025: Protein Foundation Model Pipeline for Ion Channel Kinetics¶

Status: Proposed (Phase A2 / Phase 1)
Author: OpenWorm Core Team
Date: 2026-02-22
Supersedes: None (extracted from DD017 Component 3)
Related: DD001 (Neural Circuit), DD005 (Cell-Type Specialization), DD017 (Hybrid ML Framework), DD010 (Validation Framework)

Phase: Phase A2: Governance & Derisking (cross-validation) / Phase 1 (integration) | Layer: ML/Structural Biology

TL;DR¶

Predict ion channel kinetics (HH parameters: V_half, slope, tau) from amino acid sequences using protein foundation models (AlphaFold 3, BioEmu-1, ESM Cambrian). This expands the calibration set for DD005 from ~20 neurons (limited by patch-clamp electrophysiology) toward all 128 neuron classes (limited only by sequence availability). Cross-validation against known channels begins in Phase A2; predictions feed into DD005 calibration as structure-informed priors in Phase 1.

Goal & Success Criteria¶

Goal: Build a computational pipeline that predicts Hodgkin-Huxley kinetic parameters for C. elegans ion channels from protein sequence, validated against channels with experimentally measured kinetics.

Criterion	Target	Phase	DD010 Tier
Primary: Cross-validation on known channels	< 30% relative error on HH parameters (V_half, slope, tau)	Phase A2	Tier 1 (non-blocking)
Secondary: End-to-end simulation improvement	Predicted parameters inserted into simulation do not degrade DD010 Tier 2 or Tier 3 scores below acceptance thresholds	Phase 1	Tier 2/3 (blocking)
Tertiary: Coverage expansion	Predictions available for ≥80% of ion channel genes expressed in CeNGEN	Phase 1	Non-blocking

Before: DD005 calibrates expression→conductance using ~20 neurons with patch-clamp data. Remaining 108 classes extrapolate from this small training set.

After: Structure-based kinetics predictions available for most C. elegans ion channels. DD005 uses these as calibration priors where electrophysiology is unavailable.

Deliverables¶

Artifact	Path	Format	Phase
Channel kinetics predictions	`foundation_params/output/channel_kinetics_predictions.csv`	CSV (channel, V_half_m, k_m, tau_m, V_half_h, k_h, tau_h, g_max_scale, E_rev, confidence)	Phase A2
Cross-validation report	`foundation_params/output/cross_validation_report.json`	JSON (per-channel predicted vs. measured, error metrics)	Phase A2
Foundation model inference scripts	`foundation_params/scripts/`	Python	Phase A2
Per-neuron-class HH parameters	`foundation_params/output/per_class_hh_params.csv`	CSV (128 neuron classes × channel parameters)	Phase 1
Integration adapter for DD005	`foundation_params/scripts/generate_dd005_priors.py`	Python	Phase 1

Repository & Issues¶

Item	Value
Repository	`openworm/openworm-ml` (new repo, shared with DD017) `[TO BE CREATED]`
Subdirectory	`foundation_params/`
Issue label	`dd025`
Milestone	Phase A2 — Foundation Model Channel Kinetics
Example PR title	`DD025: cross-validation of BioEmu-1 kinetics predictions on 50 channels`

Quick Action Reference¶

Question	Answer
Phase	Phase A2 (cross-validation), Phase 1 (integration)
Layer	ML/Structural Biology — parallel track derisking DD005
What does this produce?	Predicted HH kinetic parameters for C. elegans ion channels from protein sequence + structure
Success metric	Cross-validation <30% relative error on known channels; end-to-end DD010 Tier 2 scores not degraded
Repository	`openworm/openworm-ml/foundation_params/` — issues labeled `dd025`
Config toggle	`ml.foundation_params: true` in `openworm.yml`
Build & test	`python foundation_params/scripts/run_cross_validation.py`

How to Build & Test¶

Prerequisites¶

Python 3.10+, PyTorch, ESM library, internet access for model downloads
GPU recommended: CUDA (Linux) or MPS (macOS) for foundation model inference

Getting Started (Environment Setup)¶

Path A — Docker (recommended):

# From the OpenWorm meta-repo (see DD013 Simulation Stack Architecture)
docker compose build ml
# Then skip to Step 3 below — dependencies are pre-installed in the container

Cross-reference: DD013 for the full Docker Compose stack setup.

Path B — Native:

git clone https://github.com/openworm/openworm-ml.git
cd openworm-ml
pip install -e ".[dev]"  # includes PyTorch, ESM/AlphaFold dependencies

GPU Support

PyTorch with CUDA (Linux) or MPS (macOS) is strongly recommended for foundation model inference. CPU-only mode works but is significantly slower for structure prediction and conformational sampling.

Step-by-step¶

# Clone and set up (if not done above)
git clone https://github.com/openworm/openworm-ml.git
cd openworm-ml/foundation_params

# Install dependencies
pip install -r requirements.txt  # torch, esm, biopython, pandas, numpy

# Step 1: Download C. elegans ion channel sequences from WormBase
python scripts/fetch_channel_sequences.py \
    --output data/celegans_channel_sequences.fasta

# Step 2: Run structure prediction (or download from AlphaFold DB)
python scripts/predict_structures.py \
    --sequences data/celegans_channel_sequences.fasta \
    --output data/predicted_structures/

# Step 3: Run kinetics prediction pipeline
python scripts/predict_kinetics.py \
    --structures data/predicted_structures/ \
    --output output/channel_kinetics_predictions.csv

# Step 4: Cross-validate against known channels
python scripts/run_cross_validation.py \
    --predictions output/channel_kinetics_predictions.csv \
    --ground_truth data/known_channel_kinetics.csv \
    --output output/cross_validation_report.json
# Green light: relative error < 30% on HH parameters

# Step 5 (Phase 1): Generate DD005 calibration priors
python scripts/generate_dd005_priors.py \
    --predictions output/channel_kinetics_predictions.csv \
    --cengen data/CeNGEN_L4_expression.csv \
    --output output/per_class_hh_params.csv

Scripts that don't exist yet¶

Script	Status	Phase
`scripts/fetch_channel_sequences.py`	`[TO BE CREATED]`	Phase A2
`scripts/predict_structures.py`	`[TO BE CREATED]`	Phase A2
`scripts/predict_kinetics.py`	`[TO BE CREATED]`	Phase A2
`scripts/run_cross_validation.py`	`[TO BE CREATED]`	Phase A2
`scripts/generate_dd005_priors.py`	`[TO BE CREATED]`	Phase 1

Context & Background¶

The Problem: Limited Electrophysiology Data¶

DD001 uses the same generic HH parameters for all 302 neurons. DD005 proposes specializing via CeNGEN single-cell transcriptomics, but the mapping from mRNA transcript counts to functional conductance densities is a hard, unsolved problem. The current plan (DD005) proposes a hand-crafted scaling:

g_max(neuron_class, channel) = baseline_g * expression_level(neuron_class, channel) / max_expression(channel)

This is biologically naive — mRNA levels don't linearly predict protein abundance, protein abundance doesn't linearly predict functional conductance, and post-translational modification, trafficking, and localization all intervene. Only ~20 neuron types have patch-clamp electrophysiology for calibration.

The Solution: Protein Foundation Models¶

A rapidly expanding ecosystem of protein foundation models now enables prediction of ion channel kinetics directly from sequence:

Step 1: Gene sequence → Protein structure
        Tool: AlphaFold 3, Boltz-2, or Protenix
        Input: C. elegans ion channel gene sequences (from WormBase)
        Output: Predicted 3D protein structures

Step 2: Protein structure → Conformational dynamics
        Tool: BioEmu-1 (Microsoft, 100,000x MD speed)
        Input: Predicted structures
        Output: Gating transition ensembles (open ↔ closed states)

Step 3: Conformational dynamics → HH kinetics
        Tool: ML predictor (trained on channels with known kinetics)
        Input: Conformational landscape + known electrophysiology database
        Output: Predicted HH parameters (V_half, k, tau for each gate)

Step 4: Feed into DD001/DD005 HH ODEs
        Output parameters go directly into NeuroML (or differentiable backend)

Why This Changed: BioEmu-1¶

DD005 Alternative #1 originally rejected this approach because molecular dynamics was "computationally expensive (days-weeks per channel)." BioEmu-1 (Microsoft, 2025) changed this calculus: conformational ensembles at 100,000x MD speed make gating parameter prediction feasible for all C. elegans channels.

Why Phase A2 (Not Phase 3)¶

This pipeline was originally specified as DD017 Component 3 in Phase 3 (months 7-12). It belongs in Phase A2 because:

Derisks DD005: If DD005's power-law expression→conductance scaling fails for certain neuron classes, DD025 predictions are ready immediately as a fallback
No infrastructure dependencies: Inputs (WormBase sequences, literature kinetics) are available today. No Docker stack, no simulation infrastructure needed
Available tools: AlphaFold 3, BioEmu-1, ESM Cambrian are all publicly available with open-source code
Independent scope: Distinct inputs/outputs, timeline, and validation criteria from DD017 Components 1, 2, and 4

Technical Approach¶

Foundation Models for Each Pipeline Step¶

Step	Model	What It Provides	Advantage Over Generic Tools
1 (Structure)	AlphaFold 3	Protein-ion complex structures	Predicts bound ions/lipids critical for channel selectivity
1 (Structure)	Boltz-2	Open-source, single-GPU	Matches AF3 accuracy without cloud dependency
1 (Structure)	Protenix	Apache 2.0 AF3 reproduction	No licensing restrictions for integration
1→2 (Dynamics)	BioEmu-1	Conformational ensembles at 100,000x MD speed	Directly predicts gating transitions (open ↔ closed states)
2 (Embeddings)	ESM Cambrian	Protein language model (300M-6B params)	Outperforms ESM-2; captures functional properties from sequence alone
2 (Embeddings)	SaProt	Structure-aware protein LM	Combines sequence + 3Di structural tokens; better for mutation effects

BioEmu-1 is particularly significant for Step 2: instead of training a separate ML predictor on the small dataset of ~50-100 channels with known kinetics, BioEmu-1 can directly simulate the gating dynamics of any predicted channel structure and extract V_half, slope, and tau from the conformational landscape. This converts the problem from "learn kinetics from sparse data" to "simulate kinetics from abundant structures."

Implementation¶

import esm  # Meta/CZI protein language model

class ChannelKineticsPredictor(torch.nn.Module):
    """Predicts HH parameters from protein sequence embedding."""

    def __init__(self):
        super().__init__()
        self.esm_model = esm.pretrained.esm2_t33_650M_UR50D()
        self.predictor = torch.nn.Sequential(
            torch.nn.Linear(1280, 256),  # ESM2 embedding dim
            torch.nn.ReLU(),
            torch.nn.Linear(256, 64),
            torch.nn.ReLU(),
            torch.nn.Linear(64, 8),  # V_half_m, k_m, tau_m, V_half_h, k_h, tau_h, g_max_scale, E_rev
        )

    def forward(self, protein_sequence):
        embedding = self.esm_model(protein_sequence)
        hh_params = self.predictor(embedding.mean(dim=1))  # Pool over residues
        return hh_params

# Predict kinetics for C. elegans K_slow channel (egl-36)
egl36_sequence = load_wormbase_sequence("egl-36")
predicted_params = predictor(egl36_sequence)
# → V_half_m=-22mV, k_m=5.3, tau_m=12ms, ...
# Feed directly into DD001 HH model

Training Data for Kinetics Prediction¶

Approximately 50-100 ion channels across species have both:

Known 3D structures (from X-ray crystallography or cryo-EM)
Known HH kinetic parameters (from patch-clamp electrophysiology)

This is a small dataset but focused. Transfer learning from protein language model representations helps. The key channels to get right:

Channel Family	C. elegans Gene	Mammalian Homolog	Known Kinetics?
Voltage-gated K+ (Kv)	egl-36, kvs-1, shk-1	Kv1-Kv12	Yes (mammalian)
Voltage-gated Ca2+ (Cav)	egl-19, unc-2, cca-1	Cav1-Cav3	Yes (mammalian)
Ca-activated K+ (KCa)	slo-1, slo-2	BK, SK	Yes
Leak (K2P)	twk-* family	TASK, TREK	Partial
TRP channels	osm-9, ocr-*	TRPV, TRPA	Partial

Strategic Importance¶

This pipeline creates a direct dependency on CZI's ESM and DeepMind's AlphaFold. The pitch to funders becomes:

"We don't compete with your foundation models — we consume them. Your ESM3 predicts our channel kinetics. Our mechanistic simulation is the testbed that validates whether your predictions produce real organism behavior. Fund us, and we provide the multi-scale benchmark that proves your models work."

Validation¶

Predicted parameters are validated in two ways:

Cross-validation on known channels (Phase A2): Leave-one-out cross-validation on ~50-100 channels with known kinetics. Train on 80%, predict on 20%, compare predicted vs. measured HH parameters. Target: <30% relative error.
End-to-end validation (Phase 1): Insert predicted per-neuron-class parameters into the full simulation. Run DD010 validation. If Tier 2 functional connectivity is not degraded below acceptance thresholds, the pipeline is adding value.

Integration Contract¶

Inputs (What This Subsystem Consumes)¶

Input	Source	Variable	Format	Units
Ion channel gene sequences	WormBase	Protein sequences for C. elegans ion channels	FASTA	amino acids
CeNGEN expression data	DD005 / DD008	Per-class transcript levels	CSV	TPM
Known channel kinetics (training set)	Published electrophysiology + PDB	~50-100 channels with measured HH params	CSV	mV, ms, mS/cm²

Outputs (What This Subsystem Produces)¶

Output	Consumer DD	Variable	Format	Units
Predicted channel kinetics	DD005	Per-channel HH parameters (V_half, k, tau)	CSV	mV, ms, mS/cm²
Per-neuron-class HH parameters	DD001	Per-class conductances from sequence + expression	CSV / YAML	mS/cm², mV, ms
Cross-validation report	Internal	Predicted vs. measured, error metrics	JSON	mixed

Configuration (`openworm.yml` Section)¶

ml:
  # DD025: Foundation model parameters
  foundation_params: false         # Use structure-predicted channel kinetics
  esm_model: "esm2_t33_650M"
  kinetics_predictor: "models/channel_kinetics_v1.pt"

Coupling Dependencies¶

I Depend On	DD	What Breaks If They Change
CeNGEN data	DD005 / DD008	If expression data versioning changes, per-class predictions change
HH equations	DD001	If channel model equations change, predicted parameters must be remapped

Depends On Me	DD	What Breaks If I Change
Cell-type specialization (if using predicted kinetics)	DD005	If predicted conductances change, per-class models change
Neural circuit (if using per-class params)	DD001	If per-class parameters change, simulation behavior changes

Boundaries (Explicitly Out of Scope)¶

Differentiable simulation backend: That is DD017 Component 1 (Phase 3).
SPH surrogate model: That is DD017 Component 2 (Phase 3).
Learned sensory transduction: That is DD017 Component 4 (Phase 3).
Replacing DD005's CeNGEN approach: DD025 runs in parallel. If DD005's power-law scaling works, DD025 predictions serve as independent validation. If DD005 fails for certain neuron classes, DD025 predictions substitute immediately.
Neuropeptide-GPCR binding affinity: That is DD006's use of foundation models for a different application.

Implementation Roadmap¶

Phase A2: Cross-Validation (~20 hours)¶

Curate training data: Collect ~50-100 channels with both known structure (PDB) and known HH kinetics (electrophysiology literature)
Set up inference pipeline: ESM embeddings + BioEmu-1 conformational sampling for C. elegans channel sequences
Run cross-validation: Leave-one-out on known channels, report relative error on V_half, slope, tau
Deliverable: channel_kinetics_predictions.csv + cross_validation_report.json

Phase 1: Integration with DD005 (~12 hours)¶

Generate per-neuron-class parameters: Combine DD025 kinetics predictions with CeNGEN expression to produce per-class HH parameter sets
Feed into DD005 calibration: DD025 predictions serve as structure-informed priors where electrophysiology is unavailable
End-to-end validation: Insert predicted parameters into simulation, run DD010 Tier 2 + Tier 3
Deliverable: per_class_hh_params.csv integrated into DD005 pipeline

Quality Criteria¶

Cross-validation: Leave-one-out cross-validation on known channels must achieve < 30% relative error on HH parameters.
End-to-end: Predicted parameters inserted into the full simulation must not degrade DD010 Tier 2 or Tier 3 scores below acceptance thresholds.
Reproducibility: All predictions must be reproducible from sequence input alone (no manual tuning).
Provenance: Each predicted parameter must track which foundation model and version produced it.

Relationship to DD005 and DD017¶

DD005 (Cell-Type Specialization): DD025 does not replace the CeNGEN expression-based approach — it runs in parallel. If DD005's power-law scaling works, DD025 predictions serve as independent validation. If DD005 fails for certain neuron classes, DD025 predictions substitute immediately.

DD017 (Hybrid ML Framework): DD025 was originally DD017 Component 3. Components 1 (differentiable backend), 2 (SPH surrogate), and 4 (learned sensory) remain in DD017 as Phase 3 work. DD025 was extracted because it has no infrastructure dependencies and derisks DD005's uncertain mapping.

References¶

Jumper J et al. (2021). "Highly accurate protein structure prediction with AlphaFold." Nature 596:583-589. Structure prediction for channel kinetics pipeline.
Lin Z, Akin H, Rao R, et al. (2023). "Evolutionary-scale prediction of atomic-level protein structure with a language model." Science 379:1123-1130. ESM2/ESM3 protein language model.
Wohlwend J et al. (2025). "Boltz-2: Open-source, single-GPU protein structure prediction." GitHub. Open-source alternative to AlphaFold 3.
Zheng S et al. (2025). "BioEmu-1: Protein conformational ensembles at 100,000x MD speed." Microsoft Research. Key enabler — makes channel dynamics prediction feasible at scale.
Taylor SR et al. (2021). "Molecular topography of an entire nervous system." Cell 184:4329-4347. CeNGEN database — source of ion channel expression data.

Approved by: Pending
Implementation Status: Proposed
Next Actions:
Curate training dataset: ~50-100 channels with known structure + kinetics
Download C. elegans ion channel sequences from WormBase
Set up ESM + BioEmu-1 inference pipeline
Run cross-validation, assess error rates
If <30% error: generate per-neuron-class predictions for DD005 integration