Files
neuron/HANDOFF-engram-write-corruption.md
T
Tim Lingo 799ca3758b
Neuron Soul CI / build (pull_request) Successful in 3m15s
Fix chat.el node_type-slot bug + add engram write-corruption handoff
chat.el recorded the soul's utterance via engram_node(content, "episodic", ...),
putting a TIER into the node_type slot (nodes showed node_type="episodic"). Now uses
engram_node_full(..., "Conversation", "soul:utterance", ..., "Episodic", tags).

The core wrapper fix is in the el repo (PR #52). HANDOFF-engram-write-corruption.md
has the full root-cause analysis, coercion mechanism, caller audit, validation,
deploy runbook (elc build + restart), and the data-prune proposal (~107 corrupt
nodes, all unrecoverable genesis/binary detritus → prune; backup taken).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 16:14:20 -05:00

6.8 KiB
Raw Blame History

Handoff: Engram EL write-path field corruption + silent writes

For: Will (backend / EL soul) From: Tim (via Claude Code) Date: 2026-06-08 Status: Root cause confirmed; source fixes applied locally (NOT built/deployed); data analyzed; prune proposed (NOT applied).


TL;DR

The EL wrapper engram_node_full had a stale signature that didn't match the C primitive. Because el_val_t is an untyped machine word, the compiler coerced caller args to the wrong declared types and forwarded them by position into a C function whose positions mean different things → tier got ints, importance/confidence got strings, label got a float, etc. One caller (chat.el) also put a tier into the node_type slot.

Source fixes are done. You need to: review, build with elc, restart the soul, verify, and apply the prune (daemon stopped). Details below.


1. Root cause (confirmed)

C contract (el/lang/el-compiler/runtime/el_seed.h:204):

__engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)

Old wrapper (el/lang/runtime/engram.el:15-17) — stale schema, wrong names AND types:

fn engram_node_full(content: String, nt: String, sal: Float, imp: Float,
                    source: String, lang: String, ts: Int, tags: String)

Coercion mechanism: el_val_t is uintptr_t (#define EL_STR(s) ((el_val_t)(uintptr_t)(s)), EL_INT(v) (v)). The EL compiler binds each caller arg to the wrapper's declared param type (String→Float / String→Int coercion at the boundary), then the wrapper forwards positionally. Result for a correct-order caller (content,"Memory","memory:remembered",sal,imp,conf,tier,tags):

  • labelsal (a float)
  • importance ← a String
  • confidence ← a String
  • tierts (the tier String coerced to Int) → tier becomes an integer

This matches the data exactly (see §6).


2. Fix applied — wrapper (el/lang/runtime/engram.el)

Corrected to match the C contract 1:1 (no coercion, no reorder):

fn engram_node_full(content: String, node_type: String, label: String,
                    salience: Float, importance: Float, confidence: Float,
                    tier: String, tags: String) -> String {
    // validation (see §4), then:
    return __engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)
}

3. Fix applied — caller audit

Audited every caller (chat.el, awareness.el, soul.el, memory.el, routes.el, neuron-api.el). All engram_node_full callers already use the correct order — so the wrapper fix repairs them automatically. One real caller bug fixed:

neuron/chat.el:512 was:

engram_node(clean_response, "episodic", el_from_float(0.6))   // "episodic" = a TIER in the node_type slot

Now:

engram_node_full(clean_response, "Conversation", "soul:utterance",
                 el_from_float(0.6), el_from_float(0.6), el_from_float(0.8),
                 "Episodic", utterance_tags)

4. Fix applied — validation (defense in depth, engram.el)

Added engram_valid_node_type / engram_valid_tier allowlists. Both engram_node and engram_node_full now reject invalid values with __println + return "" (fail loud, never silently write a malformed node).

  • node_type allowlist: Memory, Knowledge, Belief, Project, Tag, BacklogItem, Artifact, Conversation, ExecutionContext, InternalStateEvent, Self, Entity, Process, ConfigEntry, Concept, Imprint (union of the spec list + types actually present in the store — trim if some are illegitimate).
  • tier allowlist: Semantic, Episodic, Working, Procedural, Canonical, Note, Lesson
  • Note: el_val_t is untyped, so this catches wrong VALUES, not wrong TYPES. Type safety comes from the corrected signatures.

All edits above are in the working tree on Tim's machine but NOT compiled/deployed and NOT compile-verified (no elc on that box).


5. DEPLOY RUNBOOK (your build env)

  1. Pull the edited files: el/lang/runtime/engram.el, neuron/chat.el.
  2. Build: elc (entry neuron/soul.el, import chain) → neuron/dist/*.c, then link as in el/lang/install.sh ($(CC) $(CFLAGS) -o dist/neuron-fresh dist/*.c .../el_runtime.c -lcurl -lpthread). Confirm engram.el recompiles into the import chain.
  3. Restart the soul. Note: on Tim's box it's run by /tmp/soul-keepalive.sh (an auto-restart loop) → stop that loop before killing neuron-fresh, or it'll respawn the old binary.
  4. Verify (prove end-to-end): write a node via the live API (POST /api/memories or the remember path) with an obvious throwaway label, then read it back and confirm node_type + tier are correct AND that it persisted (node_count increments; survives a snapshot save). There is no delete endpoint — clean up via the snapshot.

6. Data analysis + prune proposal (NOT applied)

  • Snapshot: ~/.neuron/engram/snapshot.json. Backup made: ~/.neuron/engram/snapshot.backup-20260608.json.
  • ~107 corrupt nodes (node_type/tier not in the valid sets). node_type junk values: '', '1', '2', 'ntn-genesis', 'claude-opus-4-8', binary. tier junk: same + '/Users/timlingo'.
  • 0 are field-repairable. They're all genesis-bootstrap / binary detritus where every field (id/label/tier/tags) is corrupted together — 69× "You are ntn-genesis, a CGI.", 62× "ntn-genesis", ~70 binary garbage, plus a proxy URL + an API path that leaked into labels. No signal to reconstruct → prune, don't fabricate.
  • Proposal: ~/.neuron/engram/snapshot.pruned.json — 3,631 clean nodes (107 junk removed), edges intact (no dangling). Byte-verified: no clean node contains binary content, so re-encoding is lossless.
  • NOT applied because the live daemon is actively rewriting snapshot.json (two reads returned different counts). Applying requires stopping the soul + keepalive, swapping in the pruned snapshot, then restarting. Do this in your controlled env with the backup retained.

7. Security heads-up (please action)

  • ANTHROPIC_API_KEY is stored in plaintext in /tmp/soul-keepalive.sh — rotate it and move to a secret store.
  • Internal infra leaked into node fields (http://localhost:7771, /api/graph/edges?limit=5000) — symptom of the same write bug; the prune removes those nodes.
  • Soul chat loop reports no tools (NONE) / NO_SHELL — it narrates curl/sqlite3 without executing. The capture REST path works, but the chat agent can't call it.
  • No PUT/DELETE on knowledge nodes (method not allowed) — needed for UI edit/delete.
  • No source-conversation edge on captured nodes — blocks "see source chat" in the UI.
  • Writes have been frozen since ~2026-04-29 (newest knowledge node) — nothing is being added in the current running state.