A compute abstraction layer that treats AI inference capacity as a managed resource pool — routed, provisioned, and optimized across the full provider landscape.
Soma is the compute abstraction layer for Neuron Technologies — a platform that treats AI inference capacity as a managed resource pool rather than a static deployment target. The central insight: AI workloads are heterogeneous, bursty, and cost-sensitive. No single provider wins on all dimensions. Soma routes, provisions, and optimizes across the full provider landscape, presenting a unified API surface to the application tier.
AI-native applications require GPU compute that is simultaneously: expensive at rest, scarce at peak, and fragmented across providers. Teams make architectural bets on specific clouds, then pay the price — vendor lock-in, idle capacity, or service gaps during demand spikes.
A control plane that knows the real cost, latency, and availability of every attached compute node — and routes requests based on workload tier, cost oracle signals, and live health. Providers become fungible. The router becomes the intelligence.
RunPod, Legion, AWS, Azure, GCP, and bare metal are all first-class node types. Soma treats them identically at the routing layer. Provider-specific adapters handle provisioning; the core stays clean.
Stable contracts (API specs, data schemas) are separated from variable behavior (routing logic) and dynamic state (live cost, availability, active jobs). Changes in one tier cannot break another. This is VBD in practice.
Neuron is the operator. Soma exposes structured, machine-readable interfaces at every layer — cost signals, health events, provisioning APIs. Autonomous operation is the design target, not the bolt-on.
Each component is classified by volatility tier — how frequently its behavior changes under normal operation. Stable components provide durable contracts. Variable components implement logic that evolves with business needs. Dynamic components reflect live system state.
The Soma Router is a deterministic decision engine, not an ML model. Predictability and auditability matter more than marginal optimization gains. Every routing decision is logged with its full decision chain.
| Tier | Criteria | Example | Priority |
|---|---|---|---|
| LOW | Batch, async, non-time-sensitive | Overnight fine-tune eval, bulk captioning | Cost-first |
| MEDIUM | Interactive, <30s SLA | Chat completion, image generation | Balance cost/latency |
| HIGH | Real-time, <2s SLA, user-facing | Live assistant, streaming response | Latency-first |
The cost oracle is queried on every routing decision. It aggregates:
Request declares required capabilities (context_length, multimodal, function_calling, language). Router queries Model Catalog for candidates. Capability match is a hard filter — no degraded fallback without explicit permission.
Model pinning is supported per-customer. Default policy: latest stable version. Canary deployments route 5% of traffic to new model version before promotion. Rollback is instantaneous (router policy change, no redeployment).
If the preferred model is unavailable: try capability-equivalent model on same provider → try same model on different provider → try next-tier model with customer notification → queue with ETA. Fallbacks are audited and surface to Observer.
| Anti-Pattern | Why Avoided | Soma Approach |
|---|---|---|
| Random load balancing | Ignores cost, warm state, GPU class mismatch | Cost-oracle weighted selection |
| ML-based router | Non-auditable, training drift, cold-start irony | Deterministic rule tree, logged decisions |
| Single-provider lock | Outage = full outage; pricing leverage lost | Anti-concentration rule (60% cap per provider) |
| Always-warm everything | Cost explodes; GPU idle waste | Tier-based warm pool: only HIGH tier always warm |
Soma provisions four environment types. Each has a defined resource profile, warm-pool policy, and billing model. Environments are ephemeral by default — they exist to run a workload, then terminate.
User-facing creative workspace. Chat, image generation, real-time feedback loops. Latency-critical — cold starts are unacceptable. Legion is the preferred provider (zero egress, instant start). RunPod H100 as hot failover.
Small tasks, quantized models, cost-optimized throughput. API integrations, automated pipelines, batch API consumers. Accepts up to 15s cold-start penalty. Prefers spot pricing.
Research, fine-tuning, LoRA training, model evaluation. Long-running jobs, max GPU VRAM, cost-tolerant on runtime but optimized on launch. Uses reserved RunPod pods or Legion when idle. The Crucible runs Lorablation and evaluation harnesses.
Customer-dedicated compute with contractual SLAs. Isolated namespaces (compute and secrets). Deployed as separate node pool partition — no resource sharing with other environments. Uptime guarantees, dedicated on-call path.
Five passes through the architecture before final form. Each loop targeted a specific quality dimension. Recorded here for architectural traceability.
Established the ten core components. Initial sketch had the router as a thin proxy and the control plane doing too much. Split the cost oracle into its own dynamic component (it changes continuously — spot prices, real-time availability — and must not be coupled to the more stable control plane contract). Added the Neuron Interface as a first-class component, not an afterthought. Recognized that API Contracts belong in the stable tier as a distinct concern from the Model Catalog.
Applied Volatility-Based Decomposition rigorously. The routing logic (how decisions are made) changes weekly with policy updates — Variable. The node state (which nodes are alive, their current cost) changes continuously — Dynamic. The storage schema and API contracts almost never change — Stable. Identified a violation: the original design coupled the Node Pool (variable — fleet composition) with node state (dynamic). Split these cleanly: the Pool is the fleet definition (variable), the state lives in the Control Plane's live registry (dynamic).
Walked the happy path: request arrives → tier classified → node selected → job runs → artifact stored → result returned. Found two friction points. (1) Cold-start latency is a seam between the Dynamic tier (live node state) and the Variable tier (router wants a warm node that doesn't exist). Resolution: warm-pool policy pushed into the Workload Orchestrator as a proactive pre-warm signal, driven by Observer's predicted load. (2) Model selection had an implicit dependency on Storage Layer for model weights — this creates a tight coupling during routing. Resolution: Model Catalog becomes the stable index, router only touches the catalog, never the storage layer directly.
Stress-tested failure scenarios. Provider outage: router must detect via health check + reroute within SLA window. Cold-start spikes: accepted as a feature of LOW tier, SLA explicitly excludes start time. Model unavailable: fallback chain defined (same capability, different provider → next-tier model → queue). Cost oracle unavailable: router falls back to cached pricing with staleness flag — HIGH tier proceeds, LOW tier queues. Secrets rotation: zero-downtime rotation via ESO — new secret version injected without pod restart. Added explicit idle-terminate threshold (15min) to prevent runaway costs on abandoned sessions.
Reexamined what Neuron actually needs to run Soma autonomously. Three action categories emerged: Observe (cost events, health events, anomaly alerts — all structured JSON), Decide (routing policy updates, warm-pool size, provider allocation — via Neuron Interface API), and Act (provision/terminate nodes, update model catalog, rotate secrets — through Workload Orchestrator). The key insight: Neuron should not have direct kubectl/API access to provider infrastructure. All actions go through Soma's own APIs — this creates an auditable, reversible action log and prevents runaway automation. Added the constraint: every Neuron-initiated action emits an event back to Observer, closing the loop.
Neuron operates Soma through a structured observe-decide-act loop. It is not given raw infrastructure access — all actions are mediated through Soma's own APIs. This is deliberate: it creates an auditable action log, enforces business rules, and allows human override at any point without needing to understand the underlying infrastructure.
| Action | Via | Guard Rails |
|---|---|---|
| Scale node pool | Workload Orchestrator API | Provider concentration limit; cost budget |
| Update routing policy | Router Policy API | Dry-run first; audit trail |
| Promote model version | Model Catalog API | Canary 5% first; health check gate |
| Adjust warm pool size | Orchestrator Policy API | Minimum warm floor enforced |
| Terminate idle nodes | Workload Orchestrator API | SLA check before termination |
| Alert Will | Email/Axon event | Threshold-gated; no alert spam |
Constraints are architectural, not policy. Neuron's service token has no permissions for these actions, regardless of reasoning.
Soma's strategic arc is provider consolidation through intelligence. The more workloads flow through Soma, the more cost and routing data accumulates. That data makes the router smarter, the cost oracle more accurate, and the pre-warm predictions more precise. It's a compounding moat built on operational intelligence — not on proprietary models or locked hardware.
The AI compute market is fractured. Teams are individually solving the multi-provider routing problem — badly, in isolation, with no pooled learning. Soma captures that problem at the platform layer. The timing window is 18-24 months before hyperscalers close the gap with purpose-built AI cloud products.
Routing intelligence compounds. Every job through Soma adds to the cost oracle's pricing model and the pre-warm predictor's demand signal. A competitor starting today has zero historical routing data. Soma at 12 months has a dataset no one can replicate without running the same workloads.
Soma manages Neuron Technologies' own compute. Legion + RunPod as initial node pool. Control plane, router, and observer built and validated. Cost savings measured. Neuron operator loop closed. The platform is its own first customer — every failure is free signal.
Trusted beta partners onboarded. Production environment (dedicated node pools) offered. Customer-isolated secrets and billing. The pipeline engine productized — customers bring workloads, Soma routes them. Revenue validates the routing model's cost-optimization claims.
AWS and Azure added to node pool. Multi-region routing. Spot-market optimization producing measurable savings vs. direct cloud spend. Cost oracle's historical dataset begins generating genuine alpha — routing decisions better than any human-tuned policy.
Soma becomes the runtime for the Neuron marketplace. Customers publish AI products; Soma executes them. The workload orchestrator handles multi-tenant isolation at scale. The routing intelligence is now a competitive differentiator that marketplace customers cite when choosing Neuron over raw cloud.
Soma offered as a standalone product — the "AI-native cloud router" for enterprise AI teams. The cost oracle data asset is the product. Competing directly with hyperscaler AI products — not on compute price (they win there), but on cross-cloud intelligence. The moat is the 4 years of routing data and the operator model.
| Competitor | Approach | Soma Advantage |
|---|---|---|
| AWS Bedrock / Azure AI | Single-cloud, lock-in model | Multi-cloud, best-of-breed per workload |
| Replicate / Modal | Serverless inference, no routing intelligence | Tier-aware routing + cost oracle + warm pools |
| Vast.ai / RunPod | Compute marketplace, no orchestration | Orchestration + pipeline + operator loop |
| Custom infra teams | Hand-built per company, no pooled learning | Platform-level intelligence; compounding data moat |
Soma is a bet that compute routing intelligence is a durable differentiator — not a feature that hyperscalers will trivially replicate. The bet holds if: (1) AI workload heterogeneity persists (multi-model, multi-modality, variable SLA), (2) no single provider achieves dominant price/performance across all workload types, and (3) the operational data asset compounds faster than competitors can replicate it. All three conditions appear structurally durable for the next 5 years.