neuron/docs/rd/sco-synthesis.md

# CCR Streaming Compressed Output (SCO) — Synthesis

**Project:** Streaming-Compatible LLM Output Compression
**Date:** 2026-04-27
**Basis:** 30 design loops, informed by RosettaEncoder.kt, CompilationEngine.kt, CcrRuntime.kt, CompiledStepPackage

---

## The Core Insight (Will's Framing, Refined)

Will described "gzip that streams." The 30-loop exploration reveals the precise mechanism: it is not gzip (which compresses after the fact), but **LLM-native output encoding via system prompt injection and pre-shared codebook**, with real-time streaming decompression on the client. The model is both content generator and encoder. The client holds the decode key before the first token arrives.

The billed unit is the token. Token cost is incurred at generation time, server-side. The only path to 90% output token reduction is for the model to generate fewer tokens while conveying the same information. This is achievable for CCR-compiled process execution steps. It is not achievable for arbitrary open-ended chat.

---

## The Four Compression Layers

### Layer 0: Schema-First Output Protocol (SFOP)
The highest-value single layer. Each CCR step's CompiledStepPackage includes a ResponseSchema. The model is prompted to respond using pipe-delimited schema fields rather than prose. The client expands fields to structured display or natural language.

```
Model output: ACTION:called_api|RESULT:success_200|NEXT:validate_response
User sees:    Action: called API. Result: success (200). Next: validate response.
```

Gain: **40–60%** on structured CCR step outputs.
Requirement: ResponseSchema in CompiledStepPackage (new field, added during compilation Stage 5).

### Layer 1: Static Codebook Substitution (Rosetta-Out)
Rosetta-In inverted. A codebook is compiled from the step's expected output domain at process compilation time. The codebook uses tokenizer-verified codes — strings confirmed to tokenize as a single token in the target model's tokenizer. The model emits codes; the client expands them.

Critical implementation note from Loop 12: **Unicode symbols (Ω, →, ★) tokenize as 2-3 tokens in tiktoken — they save nothing**. The codebook must be built from ASCII strings pre-verified as single tokens.

Gain: **20–35%** on prose content within schema fields or standalone.
Requirement: `OutputCodebookCompiler` in Soma; tokenizer-aware code selection.

### Layer 2: Semantic Label Back-References
The model assigns labels to concepts it introduces: `«ARCH_DESC: the three-tier caching system uses L1 in-memory, L2 SQLite, and L3 cold storage»`. Later in the same response, instead of restating, it emits `[§ARCH_DESC]`. The streaming decompressor expands this from its growing label index.

Gain: **10–20%** on responses with internal repetition (common in explanatory technical writing).
Requirement: label syntax in system prompt; label index in `DecompressorState`.

### Layer 3: Cross-Step Delta References
For CCR process executions where later steps would repeat earlier step outputs (e.g., a summary step that collates findings), the model instead emits `[Δstep_id]`. The CCR client has the step output in its execution cache — it expands the reference instantly.

This layer has an architectural double-use: **the same delta reference mechanism serves as the generational GC's eviction back-pointer** (Loop 22). The GC does not need a separate reference scheme — `[Δstep_id]` is the pointer to evicted content.

Gain: **15–25%** in summarization-heavy processes.
Requirement: step output cache in CCR client; L2 persistence for cross-session resumption.

---

## Combined Compression Model

For CCR structured step execution (the target workload):

| Layers Active | Expected Gain (Prompting) | Expected Gain (Fine-Tuned) |
|---------------|--------------------------|---------------------------|
| None          | 0%                       | 0%                        |
| SFOP only     | 40–60%                   | 55–70%                    |
| SFOP + Codebook | 55–70%                 | 70–82%                    |
| All four layers | 65–80%                 | 80–90%                    |

**The 90% target is real**, scoped to CCR structured outputs with fine-tuning. Without fine-tuning, 75–80% is the realistic ceiling via prompting alone.

---

## The Streaming Guarantee

Every layer is independently streamable with zero lookahead:

- **SFOP**: pipe delimiters allow field-by-field rendering as the stream arrives
- **Codebook**: code frames are at most 4-6 tokens; 2-5 token buffer maximum
- **Semantic labels**: labels are defined before they are referenced (left-to-right generation)
- **Delta references**: prior step outputs are already in the client cache before the current step streams

The user sees text appearing at normal streaming velocity. The only visual difference vs uncompressed streaming is:
1. 2-5 token pause when a code frame is being accumulated (imperceptible at typical latencies)
2. Delta reference expansion appears as a burst of text (requires fake-streaming animation from cache)

---

## What Changes in the Codebase

### CompilationEngine.kt (Stage 5 — Emit)
Add `compileOutputCodebook()` and `inferResponseSchema()` alongside the existing `compileStepPackage()`. These are called once at compile time and stored in the package.

### CompiledStepPackage.kt
Add three fields:
```kotlin
val outputCodebook: Map<String, String>?,    // null = no codebook (mode 0)
val outputSchema: ResponseSchema?,           // null = no schema (modes 0 and 1)
val compressionMode: OutputCompressionMode   // NONE, CODEBOOK, HYBRID
```

### CcrRuntime.kt (render function)
Add `RenderMode.COMPRESSED_OUTPUT`. When this mode is used, the render function appends the SCO system prompt injection to the compiled step content before it is sent to Soma.

### Soma (currently empty)
Soma should be designed with SCO as a first-class feature. The SSE protocol emits three event types: `sco-init` (pre-stream, contains codebook + schema), `token` (content), `sco-end` (post-stream, contains compliance metrics). The codebook in `sco-init` is HMAC-signed to prevent tampering.

### CCR Client (neuron-agent / TypeScript)
Add `StreamingDecompressor` class. It wraps the SSE token stream, maintains `DecompressorState`, and emits expanded tokens to the display layer. Implementation is ~100-150 lines, no external dependencies.

---

## The Tokenization Problem (Do Not Skip This)

This is the most practically important finding in the 30 loops.

The RosettaEncoder currently uses Unicode symbols (Ω, Θ, Φ, →, ★) in its codebook. These are fine for *input* compression because the LLM reads and interprets them semantically regardless of their token cost. For *output* compression, the model must *generate* the symbols — and Unicode symbols typically tokenize as 2-3 tokens in modern tokenizers. A symbol that costs 2 tokens to generate, replacing a word that costs 2 tokens to generate, achieves exactly zero compression.

**The OutputCodebookCompiler must:**
1. Load the target model's tokenizer (or a pre-computed lookup table)
2. For each candidate code string, verify it tokenizes as exactly 1 token
3. Only include verified single-token codes in the codebook
4. Rank codes by expected frequency × (tokens_saved_per_occurrence - system_prompt_cost_amortized)

This is the key engineering investment that makes the other compression layers valuable. Without it, codebook compression may actively increase token cost.

---

## System Prompt Injection Budget

SCO has a cost: the system prompt instructions that teach the model to use compressed output. Break-even analysis:

| Mode | Injection Cost | Break-Even Output Size |
|------|---------------|----------------------|
| SFOP | ~30 tokens | ~60 tokens expected output |
| Codebook | ~40 tokens | ~100 tokens expected output |
| Hybrid | ~55 tokens | ~120 tokens expected output |

**Implementation rule:** CompilationEngine should store a `expectedOutputTokens` estimate in CompiledStepPackage. Soma selects compression mode based on this estimate. Steps expected to produce fewer than 100 tokens use Mode 0 (passthrough). This prevents SCO overhead from exceeding SCO gains on short-output steps.

---

## Security Properties

1. **Codebook integrity**: the `sco-init` event HMAC is computed server-side using the session key. Clients verify before initializing the decompressor. A tampered codebook causes verification failure → fall back to passthrough mode.

2. **Delta reference trust boundary**: step outputs from steps that process user-provided content are tagged `untrusted` in the step output cache. `[Δstep_id]` references to untrusted steps are expanded with content sanitization applied (same as standard LLM output sanitization).

3. **Buffer overflow prevention**: the decompressor enforces `MAX_CODE_LENGTH = 128`. Any code frame that reaches this length without a closing delimiter is flushed as raw text. This prevents unbounded buffer growth from malformed streams.

4. **Mode-specific bypasses**: code blocks, LaTeX math, URLs, and non-English content all cause the decompressor to enter `PASSTHROUGH` mode for the affected span. The compression mode selection in CompilationEngine is content-type-aware.

---

## Failure Mode Contract

| Failure | Decompressor Behavior | User Experience |
|---------|----------------------|-----------------|
| Incomplete code at stream end | Flush buffer as raw text | Sees raw code token (acceptable) |
| Unknown code reference | Emit raw code literal | Sees `[§UNKNOWN]` (acceptable) |
| Schema field overflow | Extra content → "NOTES" field | Reads overflow as unstructured note |
| Network interruption mid-stream | Mark step incomplete, do not cache partial | Step is re-executed on resume |
| Model non-compliance | Pass-through unrecognized tokens verbatim | Sees uncompressed natural language |

The system degrades gracefully at every failure point. No failure mode corrupts the display or causes data loss. The worst case is: the user receives slightly more expensive natural language (no compression) instead of compressed output.

---

## Implementation Priority

**Do first (Phase 1, 2-3 weeks):**
- OutputCodebookCompiler with tokenizer-aware code selection
- CompiledStepPackage schema extension
- Soma SSE protocol with sco-init/sco-end events
- StreamingDecompressor in TypeScript (codebook mode only)
- Wire Rosetta-In into compilation pipeline (pre-requisite, already built)

This delivers 20–35% output token reduction with zero UX change. Use this phase to measure actual compliance rates and validate the architecture in production.

**Do second (Phase 2, 2 weeks):**
- SchemaInferenceEngine: automatically infer ResponseSchema from step definition
- SFOP decompressor mode in StreamingDecompressor
- Structured card UI for schema-field display (optional, can expand to prose)

This delivers 50–65% output token reduction. The big gains.

**Do third (Phase 3, 3 weeks):**
- Semantic label protocol (↦LABEL / [§LABEL])
- Delta reference protocol ([Δstep_id]) + step output cache
- Compliance monitoring dashboard
- Cross-session decompressor state persistence (L2)

Full SCO v1 spec. 65–80% output token reduction.

**Do last (Phase 4, 4-8 weeks):**
- Collect (uncompressed, compressed) training pairs from Phase 1-3 instrumentation
- Fine-tune a base model on CCR compressed outputs
- Deploy as Soma endpoint option, A/B test compliance rates

This is the path to 90%+ reduction.

---

## Five Patent Claims

1. **Streaming-compatible codebook output compression**: LLM generates a pre-shared codebook-encoded token stream; client decompresses in real time with zero lookahead. Distinct from prior art (LLMLingua: input-side; Brotli: byte-level; DeepMind compression: requires receiver-side LLM).

2. **Compilation-time schema inference for compressed step outputs**: response schema derived automatically from process step definitions at compile time, embedded in compiled step package, injected at inference time. Distinct from OpenAI JSON mode (hand-authored schemas, no compilation-time inference).

3. **Cross-step delta compression in multi-inference agent execution**: model references prior step outputs via delta pointers in its current response; streaming decompressor resolves pointers from execution cache. Novel: delta compression across multiple inference calls within one execution context.

4. **Delta references as GC back-pointer mechanism**: the output compression delta reference scheme (`[Δstep_id]`) doubles as the generational GC's eviction pointer, enabling near-lossless context eviction without separate reference machinery.

5. **Tokenizer-aware codebook compilation**: codebook codes are selected at compile time by verifying they tokenize as single tokens in the target model's tokenizer, maximizing compression ratio per token of system prompt overhead. Novel: incorporating the tokenizer into the compilation pipeline for output optimization.

---

## What This Is, Precisely

SCO is a **session-level compression protocol** between the CCR inference server (Soma) and the CCR client, where:
- The **model is the encoder** (prompted to emit compressed output)
- The **client is the decoder** (streaming decompressor with pre-shared state)
- The **CCR compilation pipeline** builds the encoding artifacts (codebook, schema) at compile time
- The **execution layer** manages the dynamic state (label index, delta cache)

It extends the CCR's existing compilation-and-execute model in a natural direction: the compilation pipeline already produces optimized input context (Rosetta-In); SCO extends it to produce optimized output encoding instructions. The same compiled artifact (LinkedProcess → CompiledStepPackage) that governs what the model receives now also governs how it responds.

This is the JVM analogy completing its circle: not just compiling *programs* for the agent to execute, but compiling the *protocol* through which the agent communicates its results.