Streaming-Compatible Compressed Output

The model
is the encoder.

SCO is a session-level compression protocol that directs the inference model itself to emit compact encoded output. The client decompresses in real-time as tokens arrive, without any modification to inference infrastructure.

"65–80% output token reduction. Zero latency overhead. Fully backward-compatible."

Output token reduction
100 tokens emitted (vs. baseline)
Animating to ~25 tokens — equivalent output, 4× fewer tokens billed.
Centerpiece

Live Streaming Demo

Watch the compressed token stream arrive on the left and the decompressed output materialize on the right. The compression ratio updates in real time as each token is processed.

Ratio: —
Encoded stream — what the model generates Tokens: 0
Expanded output — what you see Equivalent tokens: 0
Architecture

The Four Compression Layers

SCO stacks four independent compression techniques. Each layer compounds the gains of the prior layers. Used together, they achieve 65–80% token reduction using only prompt engineering.

Layer 0

Schema-First Output Protocol

Instead of prose, the model emits pipe-delimited schema fields. ACTION:called_api|RESULT:success_200|NEXT:validate is fully parseable and expands to a readable sentence at zero streaming overhead. The schema is negotiated in the sco-init handshake.

Token gain0%
40–60% reduction
Interactive example
Model emits
ACTION:validated_schema|RESULT:pass_3of3|ISSUES:none|NEXT:deploy_stage
Client expands
Click Run to animate
Layer 1

Codebook Substitution

A pre-shared codebook maps single-token codes to common phrases. [fn]function, [ret]returns. Critically, each code must be verified as a single token in the target tokenizer — Unicode symbols silently fail this requirement.

Token gain0%
20–35% reduction
Hover words to reveal codes
Encoded
The [fn] validate([ret] cfg) uses [§ARCH] lookup.
Expanded (hover to inspect)
The function[fn] validate(returns[ret] configuration[cfg]) uses three-tier cache architecture[§ARCH] lookup.
Layer 2

Semantic Labels

The model defines a label once using the syntax ↦LABEL: full text↤, then references it as [§LABEL] thereafter. Labels are scoped per-session and accumulate across a multi-step execution. Ideal for recurring proper nouns, system names, and long noun phrases.

Token gain0%
10–20% reduction
Interactive example
Definition (emitted once)
↦ARCH: three-tier cache architecture↤
Later reference
The [§ARCH] confirmed compatibility.
↳ click Run to expand
Layer 3

Delta References

The model emits [Δstep_id] to reference a prior step's complete output from the client's execution cache — inserting its full content without re-emitting a single token. The same reference doubles as a GC eviction back-pointer for the persistent context cache.

Token gain0%
15–25% reduction
Interactive example
Model emits (1 token)
[Δstep_1]
Client expands (from cache)
click Run to burst-expand
Critical Constraint

The Tokenization Trap

Not all short strings are single tokens. Codebook codes must be verified against the actual tokenizer — Unicode symbols and many punctuation sequences silently expand to multiple tokens, negating the compression gain entirely.

Tokenizer Inspector — BPE token analysis
Ω
T
T
2 tokens  Unicode escapes split
T
T
2 tokens  Arrow chars split in BPE
T
T
T
3 tokens  Multi-byte, not in BPE vocab
[fn]
T
1 token  Bracket-word pattern in vocab
v1
T
1 token  Alphanumeric short codes work
ok
T
1 token  Common word — in vocab
[ret]
T
1 token  Short bracket codes verified safe
Rule: All codebook codes are verified at codebook compile time by running each candidate through the target tokenizer and asserting len(tokens) == 1. Codes that fail are rejected. The verified codebook is transmitted in the sco-init SSE event alongside its HMAC signature.
Decompressor Internals

State Machine

The client-side decompressor is a deterministic state machine. It processes the raw byte stream character by character, resolving SCO constructs as they arrive without buffering or lookahead.

Current token Press Animate to begin
NORMAL
CODE_FRAME
LABEL_DEFINE
PASSTHROUGH
BURST EMIT
State machine log will appear here…
Empirical Results

Compression by Layer

Token counts measured on multi-step agentic execution traces. "Prompt-only" uses system-prompt directives alone. "Fine-tuned" uses a model specifically trained to emit SCO output, achieving closer to theoretical maximum.

Prompt-only
Fine-tuned
No compression (baseline)100%
100%
100%
Baseline: uncompressed model output. Tokens billed at full rate. No structured output contract.
Layer 0 only (SFOP)
58%
45%
SFOP alone: schema fields eliminate conversational filler. Prompt-only achieves ~42% reduction; fine-tuned reaches ~55%. Break-even at ~200 output tokens.
Layers 0 + 1 (SFOP + Codebook)
42%
32%
Adding codebook substitution compounds: frequently-repeated domain terms compress well. Prompt-only ~58%; fine-tuned ~68%. Break-even at ~300 tokens.
Layers 0 + 1 + 2 (+ Labels)
33%
24%
Semantic labels shine in multi-step sessions where proper nouns recur. Prompt-only ~67%; fine-tuned ~76%.
All four layers
25%
18%
Full stack: delta references remove repeated step payloads entirely. Prompt-only ~75%; fine-tuned ~82%. Maximum practical gain on multi-step agentic workflows.
Layer 3 Deep Dive

The Dual-Purpose Delta

A single [Δstep_id] token does two jobs simultaneously: it triggers decompression expansion for the current response, and it records a GC eviction pointer for the persistent context cache.

[Δstep_1]
Single token — dual routing
↙   ↘
1Decompression Path
StreamingDecompressor receives token
Looks up step_1 in Execution Cache
Cache hit: retrieves compiled output object
Burst-emits full expanded text to display
User sees complete prior result inline
2GC Eviction Path
GC Eviction Record receives back-pointer
Marks step_1 as referenced this turn
Updates recency weight in Persistent Cache
Prevents eviction for N subsequent steps
Context window stays compact
Click either path to animate its traversal. Both paths execute simultaneously for every [Δstep_id] token received.