Files
neuron/HANDOFF-engram-write-corruption.md
T
Tim Lingo 2112d2ffb3
Neuron Soul CI / build (pull_request) Successful in 3m17s
Add Phase 0 live-runtime findings to engram write-corruption handoff
Confirms two distinct write failures (capture=wrapper bug; backlog=axon :7771 unbuilt Rust),
soul runs in file-snapshot mode (not engram :8742 live), engram :8742 CRUD works but minimal,
+ a verification plan to run after the soul rebuild.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 16:25:12 -05:00

9.9 KiB
Raw Blame History

Handoff: Engram EL write-path field corruption + silent writes

For: Will (backend / EL soul) From: Tim (via Claude Code) Date: 2026-06-08 Status: Root cause confirmed; source fixes applied locally (NOT built/deployed); data analyzed; prune proposed (NOT applied).


TL;DR

The EL wrapper engram_node_full had a stale signature that didn't match the C primitive. Because el_val_t is an untyped machine word, the compiler coerced caller args to the wrong declared types and forwarded them by position into a C function whose positions mean different things → tier got ints, importance/confidence got strings, label got a float, etc. One caller (chat.el) also put a tier into the node_type slot.

Source fixes are done. You need to: review, build with elc, restart the soul, verify, and apply the prune (daemon stopped). Details below.


1. Root cause (confirmed)

C contract (el/lang/el-compiler/runtime/el_seed.h:204):

__engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)

Old wrapper (el/lang/runtime/engram.el:15-17) — stale schema, wrong names AND types:

fn engram_node_full(content: String, nt: String, sal: Float, imp: Float,
                    source: String, lang: String, ts: Int, tags: String)

Coercion mechanism: el_val_t is uintptr_t (#define EL_STR(s) ((el_val_t)(uintptr_t)(s)), EL_INT(v) (v)). The EL compiler binds each caller arg to the wrapper's declared param type (String→Float / String→Int coercion at the boundary), then the wrapper forwards positionally. Result for a correct-order caller (content,"Memory","memory:remembered",sal,imp,conf,tier,tags):

  • labelsal (a float)
  • importance ← a String
  • confidence ← a String
  • tierts (the tier String coerced to Int) → tier becomes an integer

This matches the data exactly (see §6).


2. Fix applied — wrapper (el/lang/runtime/engram.el)

Corrected to match the C contract 1:1 (no coercion, no reorder):

fn engram_node_full(content: String, node_type: String, label: String,
                    salience: Float, importance: Float, confidence: Float,
                    tier: String, tags: String) -> String {
    // validation (see §4), then:
    return __engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)
}

3. Fix applied — caller audit

Audited every caller (chat.el, awareness.el, soul.el, memory.el, routes.el, neuron-api.el). All engram_node_full callers already use the correct order — so the wrapper fix repairs them automatically. One real caller bug fixed:

neuron/chat.el:512 was:

engram_node(clean_response, "episodic", el_from_float(0.6))   // "episodic" = a TIER in the node_type slot

Now:

engram_node_full(clean_response, "Conversation", "soul:utterance",
                 el_from_float(0.6), el_from_float(0.6), el_from_float(0.8),
                 "Episodic", utterance_tags)

4. Fix applied — validation (defense in depth, engram.el)

Added engram_valid_node_type / engram_valid_tier allowlists. Both engram_node and engram_node_full now reject invalid values with __println + return "" (fail loud, never silently write a malformed node).

  • node_type allowlist: Memory, Knowledge, Belief, Project, Tag, BacklogItem, Artifact, Conversation, ExecutionContext, InternalStateEvent, Self, Entity, Process, ConfigEntry, Concept, Imprint (union of the spec list + types actually present in the store — trim if some are illegitimate).
  • tier allowlist: Semantic, Episodic, Working, Procedural, Canonical, Note, Lesson
  • Note: el_val_t is untyped, so this catches wrong VALUES, not wrong TYPES. Type safety comes from the corrected signatures.

All edits above are in the working tree on Tim's machine but NOT compiled/deployed and NOT compile-verified (no elc on that box).


5. DEPLOY RUNBOOK (your build env)

  1. Pull the edited files: el/lang/runtime/engram.el, neuron/chat.el.
  2. Build: elc (entry neuron/soul.el, import chain) → neuron/dist/*.c, then link as in el/lang/install.sh ($(CC) $(CFLAGS) -o dist/neuron-fresh dist/*.c .../el_runtime.c -lcurl -lpthread). Confirm engram.el recompiles into the import chain.
  3. Restart the soul. Note: on Tim's box it's run by /tmp/soul-keepalive.sh (an auto-restart loop) → stop that loop before killing neuron-fresh, or it'll respawn the old binary.
  4. Verify (prove end-to-end): write a node via the live API (POST /api/memories or the remember path) with an obvious throwaway label, then read it back and confirm node_type + tier are correct AND that it persisted (node_count increments; survives a snapshot save). There is no delete endpoint — clean up via the snapshot.

6. Data analysis + prune proposal (NOT applied)

  • Snapshot: ~/.neuron/engram/snapshot.json. Backup made: ~/.neuron/engram/snapshot.backup-20260608.json.
  • ~107 corrupt nodes (node_type/tier not in the valid sets). node_type junk values: '', '1', '2', 'ntn-genesis', 'claude-opus-4-8', binary. tier junk: same + '/Users/timlingo'.
  • 0 are field-repairable. They're all genesis-bootstrap / binary detritus where every field (id/label/tier/tags) is corrupted together — 69× "You are ntn-genesis, a CGI.", 62× "ntn-genesis", ~70 binary garbage, plus a proxy URL + an API path that leaked into labels. No signal to reconstruct → prune, don't fabricate.
  • Proposal: ~/.neuron/engram/snapshot.pruned.json — 3,631 clean nodes (107 junk removed), edges intact (no dangling). Byte-verified: no clean node contains binary content, so re-encoding is lossless.
  • NOT applied because the live daemon is actively rewriting snapshot.json (two reads returned different counts). Applying requires stopping the soul + keepalive, swapping in the pruned snapshot, then restarting. Do this in your controlled env with the backup retained.

7. Security heads-up (please action)

  • ANTHROPIC_API_KEY is stored in plaintext in /tmp/soul-keepalive.sh — rotate it and move to a secret store.
  • Internal infra leaked into node fields (http://localhost:7771, /api/graph/edges?limit=5000) — symptom of the same write bug; the prune removes those nodes.
  • Soul chat loop reports no tools (NONE) / NO_SHELL — it narrates curl/sqlite3 without executing. The capture REST path works, but the chat agent can't call it.
  • No PUT/DELETE on knowledge nodes (method not allowed) — needed for UI edit/delete.
  • No source-conversation edge on captured nodes — blocks "see source chat" in the UI.
  • Writes have been frozen since ~2026-04-29 (newest knowledge node) — nothing is being added in the current running state.

ADDENDUM — Phase 0 live runtime findings (2026-06-08, verified against the running system)

Validated the write path end-to-end against neuron-fresh :7770 + engram :8742. Confirms the diagnosis and corrects two common assumptions.

Ports: engram :8742 ✓ listening (healthy: {"status":"ok","engine":"engram-runtime-native"}), neuron-fresh :7770 ✓, :7771 NOT listening.

Two distinct write failures (not one):

  1. /api/neuron/knowledge/capture + memory remember — handled in-process by the soul (neuron-api.el handle_api_capture_knowledge / remember → engram_node_full(...)). Live test: POST …/knowledge/capture returned {"id":"2ccfc147…","ok":true} but that id is absent from /api/graph/nodes and snapshot.json → the node corrupted/vanished. This is exactly the engram_node_full wrapper bug this PR fixes. It is NOT a :7771 issue. → fixed by el PR #52 + soul rebuild.
  2. /api/backlog, /api/memories, /api/knowledge, /api/artifacts, /api/projects, /api/imprintsroutes.el proxies these to axon via axon_get/axon_post (base SOUL_AXON or default http://localhost:7771). axon = protocols/axon, an unbuilt Rust crate, not running → "Failed to connect to localhost port 7771." → needs axon stood up (separate Rust workstream) OR routes repointed.

Architecture clarifications (so nobody chases the wrong port again):

  • The soul runs in file-snapshot mode (no ENGRAM_URL in /tmp/soul-keepalive.sh) → it uses ~/.neuron/engram/snapshot.json, not engram :8742 live. So writing to :8742 does NOT make data visible to the soul the app talks to.
  • engram :8742 is its own EL service (engram/src/server.el) with a working CRUD API: POST/GET/DELETE /api/nodes, /api/edges, /api/save, /api/load, /api/activate, /api/search. Verified create+delete ({"ok":true}). But its route_create_node only reads content/node_type/salienceno label/tier/tags/metadata — so it can't set metadata.tier_source: canonical.
  • Minor EL bug in engram/src/server.el route_create_node: if str_eq(node_type,""){ let node_type = "Memory" } shadows (new local) instead of reassigning → the default never applies; same for salience. Worth fixing while in there.

Verification plan (run after the soul rebuild lands):

  1. POST /api/neuron/knowledge/capture {content,title,tier:canonical} → capture the returned id.
  2. GET /api/neuron/knowledge/search?q=<term> → confirm the node comes back with correct node_type/metadata.tier_source.
  3. Confirm it survives a snapshot save (present in snapshot.json). Only then is the write "real."
  4. Backlog: once axon :7771 is up, repeat for POST /api/backlog.

Net: "make writes persist" needs (a) this wrapper fix built into the soul (capture) and (b) axon :7771 running (backlog/artifacts/etc.). Neither was doable on Tim's box (no elc; axon is unbuilt Rust — out of scope per the no-Rust guardrail). No live writes/restarts were performed; engram probe node was created and deleted to verify the API.