NEURON TECHNOLOGIES · INTERNAL PLANNING · 2025-04
AI-NATIVE CLOUD INFRASTRUCTURE

The Soma Architecture

A compute abstraction layer that treats AI inference capacity as a managed resource pool — routed, provisioned, and optimized across the full provider landscape.

DESIGN PHASE
10
Core Components
3
Volatility Tiers
4
Workload Envs
Provider Agnostic
01 // Strategic Overview
The Central Insight

Soma is the compute abstraction layer for Neuron Technologies — a platform that treats AI inference capacity as a managed resource pool rather than a static deployment target. The central insight: AI workloads are heterogeneous, bursty, and cost-sensitive. No single provider wins on all dimensions. Soma routes, provisions, and optimizes across the full provider landscape, presenting a unified API surface to the application tier.

The Core Problem

AI-native applications require GPU compute that is simultaneously: expensive at rest, scarce at peak, and fragmented across providers. Teams make architectural bets on specific clouds, then pay the price — vendor lock-in, idle capacity, or service gaps during demand spikes.

Request arrives
Which provider?
???
The Soma Answer

A control plane that knows the real cost, latency, and availability of every attached compute node — and routes requests based on workload tier, cost oracle signals, and live health. Providers become fungible. The router becomes the intelligence.

Request arrives
SOMA Router
Optimal node

Design Principles

Design Principle 01
Provider Abstraction

RunPod, Legion, AWS, Azure, GCP, and bare metal are all first-class node types. Soma treats them identically at the routing layer. Provider-specific adapters handle provisioning; the core stays clean.

Design Principle 02
Volatility Isolation

Stable contracts (API specs, data schemas) are separated from variable behavior (routing logic) and dynamic state (live cost, availability, active jobs). Changes in one tier cannot break another. This is VBD in practice.

Design Principle 03
AI-First Operation

Neuron is the operator. Soma exposes structured, machine-readable interfaces at every layer — cost signals, health events, provisioning APIs. Autonomous operation is the design target, not the bolt-on.

Vision Codex

SOMA_VISION = "Treat GPU compute like an intelligent power grid"
ROUTING_MODEL = "tier-first, cost-second, latency-third" # deterministic priority stack
PROVIDER_STRATEGY = "no single provider exceeds 60% of active capacity" # anti-concentration rule
WARM_POOL = "always maintain ≥1 warm node per inference type" # cold-start mitigation
COST_TARGET = "autoscale to zero on idle, pre-warm before predicted demand"
02 // Architecture Diagram
Volatility-Based System Map
STABLE TIER VARIABLE TIER DYNAMIC TIER NEURON INTERFACE AI OPERATOR · AUTONOMOUS MGMT OBSERVER TELEMETRY · COST TRACKING · ANOMALY COST ORACLE REAL-TIME PRICING · SPOT SIGNALS CONTROL PLANE NODE REGISTRY · MODEL CATALOG · HEALTH MONITOR WORKLOAD ORCH. PROVISION · CONFIGURE · TERMINATE SOMA ROUTER TIER CLASSIFY · COST OPTIMIZE · LOAD BALANCE LOW / MEDIUM / HIGH INFERENCE SERVICES LLM IMAGE GEN VIDEO (SVD) PIPELINE ENGINE PANTHEON CONDUCTOR 22-STEP INFERENCE PIPELINE INHERITED · BATTLE-TESTED SECRETS LAYER VAULT · CUSTOMER ISOLATED NODE POOL RUNPOD LEGION AWS AZURE/GCP BARE METAL WARM WARM COLD PROV. WARM STORAGE LAYER R2/S3 BLOB · MODEL REGISTRY · ARTIFACT STORE MODEL CATALOG VERSIONED · CAPABILITY INDEXED API CONTRACTS STABLE INTERFACES · VERSIONED SPECS
Stable — solid border, versioned contracts
Variable — routing logic, service adapters
Dynamic — live state, cost signals, health
Animated flow — active data paths
Warm Cold Provisioning
03 // Component Reference
The Ten Components

Each component is classified by volatility tier — how frequently its behavior changes under normal operation. Stable components provide durable contracts. Variable components implement logic that evolves with business needs. Dynamic components reflect live system state.

04 // Routing Intelligence
The Decision Engine

The Soma Router is a deterministic decision engine, not an ML model. Predictability and auditability matter more than marginal optimization gains. Every routing decision is logged with its full decision chain.

Tier Classification
TierCriteriaExamplePriority
LOW Batch, async, non-time-sensitive Overnight fine-tune eval, bulk captioning Cost-first
MEDIUM Interactive, <30s SLA Chat completion, image generation Balance cost/latency
HIGH Real-time, <2s SLA, user-facing Live assistant, streaming response Latency-first
Cost Oracle Signals

The cost oracle is queried on every routing decision. It aggregates:

# Inputs to cost oracle
spot_priceRunPod/AWS real-time bid
committed_idleLegion always-on cost
marginal_costper-token / per-image
queue_depthwait cost vs. provision cost
warm_bonusdiscount for already-warm nodes
Routing Decision Tree
RECEIVE request(model, tier, budget)
CLASSIFY tier → LOW | MEDIUM | HIGH
IF tier == HIGH:
SELECT lowest-latency warm node
BYPASS cost oracle (latency wins)
ELIF tier == MEDIUM:
QUERY cost oracle
SELECT warm node within budget
IF no warm node: provision cheapest
ELIF tier == LOW:
QUERY cost oracle
SELECT cheapest (warm or cold)
ACCEPT cold-start latency
CHECK selected node health
IF unhealthy: reraise to next candidate
IF no candidates: emit capacity alert → Neuron
DISPATCH + LOG decision chain
Model Selection Logic
Capability Matching

Request declares required capabilities (context_length, multimodal, function_calling, language). Router queries Model Catalog for candidates. Capability match is a hard filter — no degraded fallback without explicit permission.

Version Policy

Model pinning is supported per-customer. Default policy: latest stable version. Canary deployments route 5% of traffic to new model version before promotion. Rollback is instantaneous (router policy change, no redeployment).

Fallback Chain

If the preferred model is unavailable: try capability-equivalent model on same provider → try same model on different provider → try next-tier model with customer notification → queue with ETA. Fallbacks are audited and surface to Observer.

Anti-Patterns Explicitly Avoided
Anti-PatternWhy AvoidedSoma Approach
Random load balancing Ignores cost, warm state, GPU class mismatch Cost-oracle weighted selection
ML-based router Non-auditable, training drift, cold-start irony Deterministic rule tree, logged decisions
Single-provider lock Outage = full outage; pricing leverage lost Anti-concentration rule (60% cap per provider)
Always-warm everything Cost explodes; GPU idle waste Tier-based warm pool: only HIGH tier always warm
05 // Workload Environments
Four Environment Types

Soma provisions four environment types. Each has a defined resource profile, warm-pool policy, and billing model. Environments are ephemeral by default — they exist to run a workload, then terminate.

ENV-01 · INTERACTIVE
Studio
Always warm HIGH tier

User-facing creative workspace. Chat, image generation, real-time feedback loops. Latency-critical — cold starts are unacceptable. Legion is the preferred provider (zero egress, instant start). RunPod H100 as hot failover.

gpu: RTX 4090 or A100
warm_policy: "always 1 warm per active user session"
billing: "per-session, pro-rated to minute"
sla: "P99 < 1s TTFT (time to first token)"
ENV-02 · LIGHTWEIGHT
Mini
On-demand MEDIUM tier

Small tasks, quantized models, cost-optimized throughput. API integrations, automated pipelines, batch API consumers. Accepts up to 15s cold-start penalty. Prefers spot pricing.

gpu: T4, A10, 3090 class
warm_policy: "1 shared warm node per region"
billing: "per-request, token-metered"
sla: "P95 < 30s total response"
ENV-03 · EXPERIMENTAL
Crucible
Ephemeral LOW tier

Research, fine-tuning, LoRA training, model evaluation. Long-running jobs, max GPU VRAM, cost-tolerant on runtime but optimized on launch. Uses reserved RunPod pods or Legion when idle. The Crucible runs Lorablation and evaluation harnesses.

gpu: H100, H200 (80GB+ VRAM req.)
warm_policy: "cold — provision on demand"
billing: "per-hour, reserved where beneficial"
sla: "best-effort, hours acceptable"
ENV-04 · ENTERPRISE
Production
Dedicated SLA-bound

Customer-dedicated compute with contractual SLAs. Isolated namespaces (compute and secrets). Deployed as separate node pool partition — no resource sharing with other environments. Uptime guarantees, dedicated on-call path.

gpu: "customer-specified"
warm_policy: "dedicated — always warm"
billing: "monthly reserved + burst overage"
sla: "99.9% uptime, contractual"

Environment Lifecycle

Request Received
Tier Classified
Node Selected / Provisioned
Job Executing
Artifact Stored
Result Delivered
Node Released / Terminated
06 // Design Improvement Loops
Five Refinement Passes

Five passes through the architecture before final form. Each loop targeted a specific quality dimension. Recorded here for architectural traceability.

01

Component Completeness

Established the ten core components. Initial sketch had the router as a thin proxy and the control plane doing too much. Split the cost oracle into its own dynamic component (it changes continuously — spot prices, real-time availability — and must not be coupled to the more stable control plane contract). Added the Neuron Interface as a first-class component, not an afterthought. Recognized that API Contracts belong in the stable tier as a distinct concern from the Model Catalog.

+ Cost Oracle separated from Control Plane · + Neuron Interface promoted to Component 10
02

VBD Volatility Boundaries

Applied Volatility-Based Decomposition rigorously. The routing logic (how decisions are made) changes weekly with policy updates — Variable. The node state (which nodes are alive, their current cost) changes continuously — Dynamic. The storage schema and API contracts almost never change — Stable. Identified a violation: the original design coupled the Node Pool (variable — fleet composition) with node state (dynamic). Split these cleanly: the Pool is the fleet definition (variable), the state lives in the Control Plane's live registry (dynamic).

+ Node Pool (variable) separated from live node state in Control Plane (dynamic)
03

Harmonic Design — Friction Analysis

Walked the happy path: request arrives → tier classified → node selected → job runs → artifact stored → result returned. Found two friction points. (1) Cold-start latency is a seam between the Dynamic tier (live node state) and the Variable tier (router wants a warm node that doesn't exist). Resolution: warm-pool policy pushed into the Workload Orchestrator as a proactive pre-warm signal, driven by Observer's predicted load. (2) Model selection had an implicit dependency on Storage Layer for model weights — this creates a tight coupling during routing. Resolution: Model Catalog becomes the stable index, router only touches the catalog, never the storage layer directly.

+ Pre-warm signal from Observer → Orchestrator · + Model Catalog as stable indirection layer
04

Operational Realism — Failure Modes

Stress-tested failure scenarios. Provider outage: router must detect via health check + reroute within SLA window. Cold-start spikes: accepted as a feature of LOW tier, SLA explicitly excludes start time. Model unavailable: fallback chain defined (same capability, different provider → next-tier model → queue). Cost oracle unavailable: router falls back to cached pricing with staleness flag — HIGH tier proceeds, LOW tier queues. Secrets rotation: zero-downtime rotation via ESO — new secret version injected without pod restart. Added explicit idle-terminate threshold (15min) to prevent runaway costs on abandoned sessions.

+ Fallback chain defined · + Cost oracle degraded mode · + 15min idle-terminate policy
05

AI Operator Interface — Autonomous Management Model

Reexamined what Neuron actually needs to run Soma autonomously. Three action categories emerged: Observe (cost events, health events, anomaly alerts — all structured JSON), Decide (routing policy updates, warm-pool size, provider allocation — via Neuron Interface API), and Act (provision/terminate nodes, update model catalog, rotate secrets — through Workload Orchestrator). The key insight: Neuron should not have direct kubectl/API access to provider infrastructure. All actions go through Soma's own APIs — this creates an auditable, reversible action log and prevents runaway automation. Added the constraint: every Neuron-initiated action emits an event back to Observer, closing the loop.

+ Neuron actions bounded to Soma API · + Action→event loop closes Observer feedback · + Runaway automation prevention
07 // Neuron as Operator
Autonomous Management
NEURON
AI Operator · Soma v1
identity: "Vault service token"
auth_scope: "soma-operator"
action_log: "append-only, audited"
human_override: "always possible"
runaway_guard: "rate limits + event loop"
OPERATOR ACTIVE
The Autonomous Management Model

Neuron operates Soma through a structured observe-decide-act loop. It is not given raw infrastructure access — all actions are mediated through Soma's own APIs. This is deliberate: it creates an auditable action log, enforces business rules, and allows human override at any point without needing to understand the underlying infrastructure.

OBSERVE
Read cost events, health alerts, anomalies from Observer structured stream
DECIDE
Apply policy, backlog context, and historical patterns to form an action plan
ACT
Invoke Soma APIs: provision, terminate, update policy, rotate secrets
Neuron's Permitted Actions
ActionViaGuard Rails
Scale node poolWorkload Orchestrator APIProvider concentration limit; cost budget
Update routing policyRouter Policy APIDry-run first; audit trail
Promote model versionModel Catalog APICanary 5% first; health check gate
Adjust warm pool sizeOrchestrator Policy APIMinimum warm floor enforced
Terminate idle nodesWorkload Orchestrator APISLA check before termination
Alert WillEmail/Axon eventThreshold-gated; no alert spam
What Neuron Cannot Do (By Design)
Direct kubectl commands
Raw provider API calls
Modify Vault root tokens
Delete customer data
Override SLA contracts
Spend beyond cost ceiling
Bypass action audit log

Constraints are architectural, not policy. Neuron's service token has no permissions for these actions, regardless of reasoning.

08 // The 5-Year Play
Strategic Arc

Soma's strategic arc is provider consolidation through intelligence. The more workloads flow through Soma, the more cost and routing data accumulates. That data makes the router smarter, the cost oracle more accurate, and the pre-warm predictions more precise. It's a compounding moat built on operational intelligence — not on proprietary models or locked hardware.

Why Now

The AI compute market is fractured. Teams are individually solving the multi-provider routing problem — badly, in isolation, with no pooled learning. Soma captures that problem at the platform layer. The timing window is 18-24 months before hyperscalers close the gap with purpose-built AI cloud products.

The Moat

Routing intelligence compounds. Every job through Soma adds to the cost oracle's pricing model and the pre-warm predictor's demand signal. A competitor starting today has zero historical routing data. Soma at 12 months has a dataset no one can replicate without running the same workloads.

Five-Year Roadmap

2025 — YEAR 1

Internal Proof of Concept

Soma manages Neuron Technologies' own compute. Legion + RunPod as initial node pool. Control plane, router, and observer built and validated. Cost savings measured. Neuron operator loop closed. The platform is its own first customer — every failure is free signal.

Legion + RunPod Internal only Neuron as operator
2026 — YEAR 2

First External Customers

Trusted beta partners onboarded. Production environment (dedicated node pools) offered. Customer-isolated secrets and billing. The pipeline engine productized — customers bring workloads, Soma routes them. Revenue validates the routing model's cost-optimization claims.

Beta partners Production env Revenue signal
2027 — YEAR 3

Platform Expansion

AWS and Azure added to node pool. Multi-region routing. Spot-market optimization producing measurable savings vs. direct cloud spend. Cost oracle's historical dataset begins generating genuine alpha — routing decisions better than any human-tuned policy.

Multi-cloud Multi-region Oracle alpha
2028 — YEAR 4

Marketplace Integration

Soma becomes the runtime for the Neuron marketplace. Customers publish AI products; Soma executes them. The workload orchestrator handles multi-tenant isolation at scale. The routing intelligence is now a competitive differentiator that marketplace customers cite when choosing Neuron over raw cloud.

Marketplace runtime Multi-tenant scale Competitive moat
2029 — YEAR 5

Infrastructure as a Platform

Soma offered as a standalone product — the "AI-native cloud router" for enterprise AI teams. The cost oracle data asset is the product. Competing directly with hyperscaler AI products — not on compute price (they win there), but on cross-cloud intelligence. The moat is the 4 years of routing data and the operator model.

Standalone product Enterprise AI Data asset moat

Competitive Positioning

CompetitorApproachSoma Advantage
AWS Bedrock / Azure AI Single-cloud, lock-in model Multi-cloud, best-of-breed per workload
Replicate / Modal Serverless inference, no routing intelligence Tier-aware routing + cost oracle + warm pools
Vast.ai / RunPod Compute marketplace, no orchestration Orchestration + pipeline + operator loop
Custom infra teams Hand-built per company, no pooled learning Platform-level intelligence; compounding data moat
The Irreducible Bet

Soma is a bet that compute routing intelligence is a durable differentiator — not a feature that hyperscalers will trivially replicate. The bet holds if: (1) AI workload heterogeneity persists (multi-model, multi-modality, variable SLA), (2) no single provider achieves dominant price/performance across all workload types, and (3) the operational data asset compounds faster than competitors can replicate it. All three conditions appear structurally durable for the next 5 years.