224 lines
14 KiB
Markdown
224 lines
14 KiB
Markdown
# CCR Streaming Compressed Output (SCO) — Synthesis
|
||
|
||
**Project:** Streaming-Compatible LLM Output Compression
|
||
**Date:** 2026-04-27
|
||
**Basis:** 30 design loops, informed by RosettaEncoder.kt, CompilationEngine.kt, CcrRuntime.kt, CompiledStepPackage
|
||
|
||
---
|
||
|
||
## The Core Insight (Will's Framing, Refined)
|
||
|
||
Will described "gzip that streams." The 30-loop exploration reveals the precise mechanism: it is not gzip (which compresses after the fact), but **LLM-native output encoding via system prompt injection and pre-shared codebook**, with real-time streaming decompression on the client. The model is both content generator and encoder. The client holds the decode key before the first token arrives.
|
||
|
||
The billed unit is the token. Token cost is incurred at generation time, server-side. The only path to 90% output token reduction is for the model to generate fewer tokens while conveying the same information. This is achievable for CCR-compiled process execution steps. It is not achievable for arbitrary open-ended chat.
|
||
|
||
---
|
||
|
||
## The Four Compression Layers
|
||
|
||
### Layer 0: Schema-First Output Protocol (SFOP)
|
||
The highest-value single layer. Each CCR step's CompiledStepPackage includes a ResponseSchema. The model is prompted to respond using pipe-delimited schema fields rather than prose. The client expands fields to structured display or natural language.
|
||
|
||
```
|
||
Model output: ACTION:called_api|RESULT:success_200|NEXT:validate_response
|
||
User sees: Action: called API. Result: success (200). Next: validate response.
|
||
```
|
||
|
||
Gain: **40–60%** on structured CCR step outputs.
|
||
Requirement: ResponseSchema in CompiledStepPackage (new field, added during compilation Stage 5).
|
||
|
||
### Layer 1: Static Codebook Substitution (Rosetta-Out)
|
||
Rosetta-In inverted. A codebook is compiled from the step's expected output domain at process compilation time. The codebook uses tokenizer-verified codes — strings confirmed to tokenize as a single token in the target model's tokenizer. The model emits codes; the client expands them.
|
||
|
||
Critical implementation note from Loop 12: **Unicode symbols (Ω, →, ★) tokenize as 2-3 tokens in tiktoken — they save nothing**. The codebook must be built from ASCII strings pre-verified as single tokens.
|
||
|
||
Gain: **20–35%** on prose content within schema fields or standalone.
|
||
Requirement: `OutputCodebookCompiler` in Soma; tokenizer-aware code selection.
|
||
|
||
### Layer 2: Semantic Label Back-References
|
||
The model assigns labels to concepts it introduces: `«ARCH_DESC: the three-tier caching system uses L1 in-memory, L2 SQLite, and L3 cold storage»`. Later in the same response, instead of restating, it emits `[§ARCH_DESC]`. The streaming decompressor expands this from its growing label index.
|
||
|
||
Gain: **10–20%** on responses with internal repetition (common in explanatory technical writing).
|
||
Requirement: label syntax in system prompt; label index in `DecompressorState`.
|
||
|
||
### Layer 3: Cross-Step Delta References
|
||
For CCR process executions where later steps would repeat earlier step outputs (e.g., a summary step that collates findings), the model instead emits `[Δstep_id]`. The CCR client has the step output in its execution cache — it expands the reference instantly.
|
||
|
||
This layer has an architectural double-use: **the same delta reference mechanism serves as the generational GC's eviction back-pointer** (Loop 22). The GC does not need a separate reference scheme — `[Δstep_id]` is the pointer to evicted content.
|
||
|
||
Gain: **15–25%** in summarization-heavy processes.
|
||
Requirement: step output cache in CCR client; L2 persistence for cross-session resumption.
|
||
|
||
---
|
||
|
||
## Combined Compression Model
|
||
|
||
For CCR structured step execution (the target workload):
|
||
|
||
| Layers Active | Expected Gain (Prompting) | Expected Gain (Fine-Tuned) |
|
||
|---------------|--------------------------|---------------------------|
|
||
| None | 0% | 0% |
|
||
| SFOP only | 40–60% | 55–70% |
|
||
| SFOP + Codebook | 55–70% | 70–82% |
|
||
| All four layers | 65–80% | 80–90% |
|
||
|
||
**The 90% target is real**, scoped to CCR structured outputs with fine-tuning. Without fine-tuning, 75–80% is the realistic ceiling via prompting alone.
|
||
|
||
---
|
||
|
||
## The Streaming Guarantee
|
||
|
||
Every layer is independently streamable with zero lookahead:
|
||
|
||
- **SFOP**: pipe delimiters allow field-by-field rendering as the stream arrives
|
||
- **Codebook**: code frames are at most 4-6 tokens; 2-5 token buffer maximum
|
||
- **Semantic labels**: labels are defined before they are referenced (left-to-right generation)
|
||
- **Delta references**: prior step outputs are already in the client cache before the current step streams
|
||
|
||
The user sees text appearing at normal streaming velocity. The only visual difference vs uncompressed streaming is:
|
||
1. 2-5 token pause when a code frame is being accumulated (imperceptible at typical latencies)
|
||
2. Delta reference expansion appears as a burst of text (requires fake-streaming animation from cache)
|
||
|
||
---
|
||
|
||
## What Changes in the Codebase
|
||
|
||
### CompilationEngine.kt (Stage 5 — Emit)
|
||
Add `compileOutputCodebook()` and `inferResponseSchema()` alongside the existing `compileStepPackage()`. These are called once at compile time and stored in the package.
|
||
|
||
### CompiledStepPackage.kt
|
||
Add three fields:
|
||
```kotlin
|
||
val outputCodebook: Map<String, String>?, // null = no codebook (mode 0)
|
||
val outputSchema: ResponseSchema?, // null = no schema (modes 0 and 1)
|
||
val compressionMode: OutputCompressionMode // NONE, CODEBOOK, HYBRID
|
||
```
|
||
|
||
### CcrRuntime.kt (render function)
|
||
Add `RenderMode.COMPRESSED_OUTPUT`. When this mode is used, the render function appends the SCO system prompt injection to the compiled step content before it is sent to Soma.
|
||
|
||
### Soma (currently empty)
|
||
Soma should be designed with SCO as a first-class feature. The SSE protocol emits three event types: `sco-init` (pre-stream, contains codebook + schema), `token` (content), `sco-end` (post-stream, contains compliance metrics). The codebook in `sco-init` is HMAC-signed to prevent tampering.
|
||
|
||
### CCR Client (neuron-agent / TypeScript)
|
||
Add `StreamingDecompressor` class. It wraps the SSE token stream, maintains `DecompressorState`, and emits expanded tokens to the display layer. Implementation is ~100-150 lines, no external dependencies.
|
||
|
||
---
|
||
|
||
## The Tokenization Problem (Do Not Skip This)
|
||
|
||
This is the most practically important finding in the 30 loops.
|
||
|
||
The RosettaEncoder currently uses Unicode symbols (Ω, Θ, Φ, →, ★) in its codebook. These are fine for *input* compression because the LLM reads and interprets them semantically regardless of their token cost. For *output* compression, the model must *generate* the symbols — and Unicode symbols typically tokenize as 2-3 tokens in modern tokenizers. A symbol that costs 2 tokens to generate, replacing a word that costs 2 tokens to generate, achieves exactly zero compression.
|
||
|
||
**The OutputCodebookCompiler must:**
|
||
1. Load the target model's tokenizer (or a pre-computed lookup table)
|
||
2. For each candidate code string, verify it tokenizes as exactly 1 token
|
||
3. Only include verified single-token codes in the codebook
|
||
4. Rank codes by expected frequency × (tokens_saved_per_occurrence - system_prompt_cost_amortized)
|
||
|
||
This is the key engineering investment that makes the other compression layers valuable. Without it, codebook compression may actively increase token cost.
|
||
|
||
---
|
||
|
||
## System Prompt Injection Budget
|
||
|
||
SCO has a cost: the system prompt instructions that teach the model to use compressed output. Break-even analysis:
|
||
|
||
| Mode | Injection Cost | Break-Even Output Size |
|
||
|------|---------------|----------------------|
|
||
| SFOP | ~30 tokens | ~60 tokens expected output |
|
||
| Codebook | ~40 tokens | ~100 tokens expected output |
|
||
| Hybrid | ~55 tokens | ~120 tokens expected output |
|
||
|
||
**Implementation rule:** CompilationEngine should store a `expectedOutputTokens` estimate in CompiledStepPackage. Soma selects compression mode based on this estimate. Steps expected to produce fewer than 100 tokens use Mode 0 (passthrough). This prevents SCO overhead from exceeding SCO gains on short-output steps.
|
||
|
||
---
|
||
|
||
## Security Properties
|
||
|
||
1. **Codebook integrity**: the `sco-init` event HMAC is computed server-side using the session key. Clients verify before initializing the decompressor. A tampered codebook causes verification failure → fall back to passthrough mode.
|
||
|
||
2. **Delta reference trust boundary**: step outputs from steps that process user-provided content are tagged `untrusted` in the step output cache. `[Δstep_id]` references to untrusted steps are expanded with content sanitization applied (same as standard LLM output sanitization).
|
||
|
||
3. **Buffer overflow prevention**: the decompressor enforces `MAX_CODE_LENGTH = 128`. Any code frame that reaches this length without a closing delimiter is flushed as raw text. This prevents unbounded buffer growth from malformed streams.
|
||
|
||
4. **Mode-specific bypasses**: code blocks, LaTeX math, URLs, and non-English content all cause the decompressor to enter `PASSTHROUGH` mode for the affected span. The compression mode selection in CompilationEngine is content-type-aware.
|
||
|
||
---
|
||
|
||
## Failure Mode Contract
|
||
|
||
| Failure | Decompressor Behavior | User Experience |
|
||
|---------|----------------------|-----------------|
|
||
| Incomplete code at stream end | Flush buffer as raw text | Sees raw code token (acceptable) |
|
||
| Unknown code reference | Emit raw code literal | Sees `[§UNKNOWN]` (acceptable) |
|
||
| Schema field overflow | Extra content → "NOTES" field | Reads overflow as unstructured note |
|
||
| Network interruption mid-stream | Mark step incomplete, do not cache partial | Step is re-executed on resume |
|
||
| Model non-compliance | Pass-through unrecognized tokens verbatim | Sees uncompressed natural language |
|
||
|
||
The system degrades gracefully at every failure point. No failure mode corrupts the display or causes data loss. The worst case is: the user receives slightly more expensive natural language (no compression) instead of compressed output.
|
||
|
||
---
|
||
|
||
## Implementation Priority
|
||
|
||
**Do first (Phase 1, 2-3 weeks):**
|
||
- OutputCodebookCompiler with tokenizer-aware code selection
|
||
- CompiledStepPackage schema extension
|
||
- Soma SSE protocol with sco-init/sco-end events
|
||
- StreamingDecompressor in TypeScript (codebook mode only)
|
||
- Wire Rosetta-In into compilation pipeline (pre-requisite, already built)
|
||
|
||
This delivers 20–35% output token reduction with zero UX change. Use this phase to measure actual compliance rates and validate the architecture in production.
|
||
|
||
**Do second (Phase 2, 2 weeks):**
|
||
- SchemaInferenceEngine: automatically infer ResponseSchema from step definition
|
||
- SFOP decompressor mode in StreamingDecompressor
|
||
- Structured card UI for schema-field display (optional, can expand to prose)
|
||
|
||
This delivers 50–65% output token reduction. The big gains.
|
||
|
||
**Do third (Phase 3, 3 weeks):**
|
||
- Semantic label protocol (↦LABEL / [§LABEL])
|
||
- Delta reference protocol ([Δstep_id]) + step output cache
|
||
- Compliance monitoring dashboard
|
||
- Cross-session decompressor state persistence (L2)
|
||
|
||
Full SCO v1 spec. 65–80% output token reduction.
|
||
|
||
**Do last (Phase 4, 4-8 weeks):**
|
||
- Collect (uncompressed, compressed) training pairs from Phase 1-3 instrumentation
|
||
- Fine-tune a base model on CCR compressed outputs
|
||
- Deploy as Soma endpoint option, A/B test compliance rates
|
||
|
||
This is the path to 90%+ reduction.
|
||
|
||
---
|
||
|
||
## Five Patent Claims
|
||
|
||
1. **Streaming-compatible codebook output compression**: LLM generates a pre-shared codebook-encoded token stream; client decompresses in real time with zero lookahead. Distinct from prior art (LLMLingua: input-side; Brotli: byte-level; DeepMind compression: requires receiver-side LLM).
|
||
|
||
2. **Compilation-time schema inference for compressed step outputs**: response schema derived automatically from process step definitions at compile time, embedded in compiled step package, injected at inference time. Distinct from OpenAI JSON mode (hand-authored schemas, no compilation-time inference).
|
||
|
||
3. **Cross-step delta compression in multi-inference agent execution**: model references prior step outputs via delta pointers in its current response; streaming decompressor resolves pointers from execution cache. Novel: delta compression across multiple inference calls within one execution context.
|
||
|
||
4. **Delta references as GC back-pointer mechanism**: the output compression delta reference scheme (`[Δstep_id]`) doubles as the generational GC's eviction pointer, enabling near-lossless context eviction without separate reference machinery.
|
||
|
||
5. **Tokenizer-aware codebook compilation**: codebook codes are selected at compile time by verifying they tokenize as single tokens in the target model's tokenizer, maximizing compression ratio per token of system prompt overhead. Novel: incorporating the tokenizer into the compilation pipeline for output optimization.
|
||
|
||
---
|
||
|
||
## What This Is, Precisely
|
||
|
||
SCO is a **session-level compression protocol** between the CCR inference server (Soma) and the CCR client, where:
|
||
- The **model is the encoder** (prompted to emit compressed output)
|
||
- The **client is the decoder** (streaming decompressor with pre-shared state)
|
||
- The **CCR compilation pipeline** builds the encoding artifacts (codebook, schema) at compile time
|
||
- The **execution layer** manages the dynamic state (label index, delta cache)
|
||
|
||
It extends the CCR's existing compilation-and-execute model in a natural direction: the compilation pipeline already produces optimized input context (Rosetta-In); SCO extends it to produce optimized output encoding instructions. The same compiled artifact (LinkedProcess → CompiledStepPackage) that governs what the model receives now also governs how it responds.
|
||
|
||
This is the JVM analogy completing its circle: not just compiling *programs* for the agent to execute, but compiling the *protocol* through which the agent communicates its results.
|