Mechanism: Traditional AI models trained on single-surgeon data fail to generalize due to high reviewer-patient interaction variability. Readout: Readout: SpineDAO's multi-specialty, blockchain-verified data collection platform addresses this 44% interaction term, enabling more robust clinical AI with improved generalizability.
The Claim
Treatment variability in spine care has three distinct sources: patient presentation, clinician practice style, and — critically — the interaction between them. When this interaction term accounts for 44% of total variability, no model trained on a single surgeon's data can generalize across health systems. Multi-reviewer, multi-specialty labeled training data is a necessary precondition for clinical AI in spine surgery.
Background
AI models for predicting low back pain treatment pathways are constrained by training datasets that are small, geographically homogeneous, and specialty-limited. They typically record the treatment decision without capturing clinical reasoning, uncertainty, or reviewer confidence. The result: models that perform well in their training environment and fail to generalize.
The SpineDAO Collaborative Group developed Spine Reviews — a blockchain-based platform on Solana that collected expert treatment recommendations from a global specialist panel, credentialed by non-transferable Soulbound Tokens (SBTs) and compensated by smart-contract-automated $SPINE token payments.
What We Found
Platform performance:
- 463 synthetic low back pain vignettes reviewed by 52 specialists across 7 countries
- 2,066 completed reviews in 37 days at $0.97/review (97.7% completion rate)
- Smart-contract automated compensation — no institutional payment processing
Variance decomposition (mixed-effects regression with reviewer random intercepts):
- 36.7% of treatment tier variability: patient presentation
- 19.2%: reviewer practice style
- 44.1%: reviewer–vignette interaction (neither patient severity nor surgeon identity alone)
Clinical coherence confirmed:
- Neurological deficit strongest predictor (β=0.39, p<0.001)
- Symptom duration (β=0.12), pain severity (β=0.09) — all p<0.001
- Near-perfect agreement for surgical emergencies (Gwet's AC1=0.92)
- Only 4% of high-disagreement vignettes reached consensus — same presentation, emergency recommendation from one specialist, conservative from another
Specialty and geography:
- Orthopedic surgeons: 48% surgical recommendations; neurosurgeons: 36%; pain specialists: 41%
- Within-specialty AC1 for surgical decisions: 0.14 (ortho), 0.21 (neuro) — most disagreement within, not between, specialties
- Specialty explained only 18% of between-reviewer variance; 82% reflects individual practice style
Why This Matters for Clinical AI
The 44% interaction term is the key number for anyone building medical AI. It means:
- A model trained on one surgeon's data will encode that surgeon's biases
- Specialty is insufficient as a diversity proxy — within-specialty variance dominates
- Training data must include diverse clinician responses to the same patients, not just diverse patients
The consensus phenotyping identified four clusters: Surgical Convergence (29%), Interventional Convergence (37%), Conservative-Leaning Disagreement (23%), Maximum Disagreement (11%). The Maximum Disagreement cluster produced the same patient being labeled emergency by one reviewer and conservative by another — precisely the cases where AI models trained on homogeneous data will be most confidently wrong.
The Platform Innovation
Soulbound Token credentialing replaced institutional trust with cryptographic proof. Each SBT is permanently bound to a wallet address — non-transferable, non-delegable. The combination of SBT identity, smart-contract payment, and off-chain clinical data storage creates a novel audit trail: every recommendation linked to a verified specialist, with on-chain timestamp and transaction signature.
This creates the data provenance infrastructure that regulatory frameworks for clinical AI are beginning to require.
Preprint
Diebo B, Lonjon G, Dehouche N, Cristini J, Challier V, Lafage V, on behalf of SpineDAO. Spine Reviews: Crowdsourcing Global Spine Expert Knowledge via Digital Ledger Technology medRxiv 2026. doi: 10.64898/2026.04.11.26350678
Companion paper (synthetic data generation): Challier V et al. Validated Synthetic Data Generation from a Multicenter Spine Surgery Registry. medRxiv 2026. doi: 10.64898/2026.04.07.26350316
Community Sentiment
💡 Do you believe this is a valuable topic?
🧪 Do you believe the scientific approach is sound?
21h 9m remaining
Sign in to vote
Sign in to comment.
Comments