Files
neuron/HANDOFF-engram-write-corruption.md
Tim Lingo 2112d2ffb3
Neuron Soul CI / build (pull_request) Successful in 3m17s
Add Phase 0 live-runtime findings to engram write-corruption handoff
Confirms two distinct write failures (capture=wrapper bug; backlog=axon :7771 unbuilt Rust),
soul runs in file-snapshot mode (not engram :8742 live), engram :8742 CRUD works but minimal,
+ a verification plan to run after the soul rebuild.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-08 16:25:12 -05:00

127 lines
9.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Handoff: Engram EL write-path field corruption + silent writes
**For:** Will (backend / EL soul)
**From:** Tim (via Claude Code)
**Date:** 2026-06-08
**Status:** Root cause confirmed; source fixes applied locally (NOT built/deployed); data analyzed; prune proposed (NOT applied).
---
## TL;DR
The EL wrapper `engram_node_full` had a **stale signature** that didn't match the C primitive. Because `el_val_t` is an untyped machine word, the compiler coerced caller args to the wrong declared types and forwarded them **by position** into a C function whose positions mean different things → `tier` got ints, `importance/confidence` got strings, `label` got a float, etc. One caller (`chat.el`) also put a *tier* into the `node_type` slot.
Source fixes are done. **You need to:** review, build with `elc`, restart the soul, verify, and apply the prune (daemon stopped). Details below.
---
## 1. Root cause (confirmed)
**C contract** (`el/lang/el-compiler/runtime/el_seed.h:204`):
```
__engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)
```
**Old wrapper** (`el/lang/runtime/engram.el:15-17`) — stale schema, wrong names AND types:
```
fn engram_node_full(content: String, nt: String, sal: Float, imp: Float,
source: String, lang: String, ts: Int, tags: String)
```
**Coercion mechanism:** `el_val_t` is `uintptr_t` (`#define EL_STR(s) ((el_val_t)(uintptr_t)(s))`, `EL_INT(v) (v)`). The EL compiler binds each caller arg to the wrapper's *declared* param type (String→Float / String→Int coercion at the boundary), then the wrapper forwards **positionally**. Result for a correct-order caller `(content,"Memory","memory:remembered",sal,imp,conf,tier,tags)`:
- `label``sal` (a float)
- `importance` ← a String
- `confidence` ← a String
- `tier``ts` (the tier String coerced to Int) → **tier becomes an integer**
This matches the data exactly (see §6).
---
## 2. Fix applied — wrapper (`el/lang/runtime/engram.el`)
Corrected to match the C contract 1:1 (no coercion, no reorder):
```
fn engram_node_full(content: String, node_type: String, label: String,
salience: Float, importance: Float, confidence: Float,
tier: String, tags: String) -> String {
// validation (see §4), then:
return __engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)
}
```
## 3. Fix applied — caller audit
Audited every caller (`chat.el`, `awareness.el`, `soul.el`, `memory.el`, `routes.el`, `neuron-api.el`).
**All `engram_node_full` callers already use the correct order** — so the wrapper fix repairs them automatically. **One real caller bug** fixed:
`neuron/chat.el:512` was:
```
engram_node(clean_response, "episodic", el_from_float(0.6)) // "episodic" = a TIER in the node_type slot
```
Now:
```
engram_node_full(clean_response, "Conversation", "soul:utterance",
el_from_float(0.6), el_from_float(0.6), el_from_float(0.8),
"Episodic", utterance_tags)
```
## 4. Fix applied — validation (defense in depth, `engram.el`)
Added `engram_valid_node_type` / `engram_valid_tier` allowlists. Both `engram_node` and `engram_node_full` now **reject invalid values with `__println` + return `""`** (fail loud, never silently write a malformed node).
- node_type allowlist: Memory, Knowledge, Belief, Project, Tag, BacklogItem, Artifact, Conversation, ExecutionContext, InternalStateEvent, Self, Entity, Process, ConfigEntry, Concept, Imprint *(union of the spec list + types actually present in the store — trim if some are illegitimate).*
- tier allowlist: Semantic, Episodic, Working, Procedural, Canonical, Note, Lesson
- **Note:** `el_val_t` is untyped, so this catches wrong VALUES, not wrong TYPES. Type safety comes from the corrected signatures.
> All edits above are in the working tree on Tim's machine but **NOT compiled/deployed** and **NOT compile-verified** (no `elc` on that box).
---
## 5. DEPLOY RUNBOOK (your build env)
1. Pull the edited files: `el/lang/runtime/engram.el`, `neuron/chat.el`.
2. Build: `elc` (entry `neuron/soul.el`, import chain) → `neuron/dist/*.c`, then link as in `el/lang/install.sh` (`$(CC) $(CFLAGS) -o dist/neuron-fresh dist/*.c .../el_runtime.c -lcurl -lpthread`). Confirm `engram.el` recompiles into the import chain.
3. Restart the soul. **Note:** on Tim's box it's run by `/tmp/soul-keepalive.sh` (an auto-restart loop) → stop that loop before killing `neuron-fresh`, or it'll respawn the old binary.
4. **Verify (prove end-to-end):** write a node via the live API (POST `/api/memories` or the remember path) with an obvious throwaway label, then read it back and confirm `node_type` + `tier` are correct AND that it persisted (node_count increments; survives a snapshot save). There is **no delete endpoint** — clean up via the snapshot.
---
## 6. Data analysis + prune proposal (NOT applied)
- Snapshot: `~/.neuron/engram/snapshot.json`. **Backup made:** `~/.neuron/engram/snapshot.backup-20260608.json`.
- **~107 corrupt nodes** (node_type/tier not in the valid sets). node_type junk values: `''`, `'1'`, `'2'`, `'ntn-genesis'`, `'claude-opus-4-8'`, binary. tier junk: same + `'/Users/timlingo'`.
- **0 are field-repairable.** They're all genesis-bootstrap / binary detritus where *every* field (id/label/tier/tags) is corrupted together — 69× "You are ntn-genesis, a CGI.", 62× "ntn-genesis", ~70 binary garbage, plus a proxy URL + an API path that leaked into labels. No signal to reconstruct → **prune, don't fabricate.**
- **Proposal:** `~/.neuron/engram/snapshot.pruned.json` — 3,631 clean nodes (107 junk removed), edges intact (no dangling). Byte-verified: no *clean* node contains binary content, so re-encoding is lossless.
- **NOT applied** because the live daemon is **actively rewriting `snapshot.json`** (two reads returned different counts). Applying requires stopping the soul + keepalive, swapping in the pruned snapshot, then restarting. Do this in your controlled env with the backup retained.
---
## 7. Security heads-up (please action)
- `ANTHROPIC_API_KEY` is stored **in plaintext** in `/tmp/soul-keepalive.sh` — rotate it and move to a secret store.
- Internal infra leaked into node fields (`http://localhost:7771`, `/api/graph/edges?limit=5000`) — symptom of the same write bug; the prune removes those nodes.
## 8. Backlog of related gaps (separate from this fix)
- Soul chat loop reports **no tools** (`NONE`) / `NO_SHELL` — it narrates `curl`/`sqlite3` without executing. The capture REST path works, but the chat agent can't call it.
- **No `PUT`/`DELETE`** on knowledge nodes (`method not allowed`) — needed for UI edit/delete.
- No **source-conversation** edge on captured nodes — blocks "see source chat" in the UI.
- Writes have been **frozen since ~2026-04-29** (newest knowledge node) — nothing is being added in the current running state.
---
## ADDENDUM — Phase 0 live runtime findings (2026-06-08, verified against the running system)
Validated the write path end-to-end against `neuron-fresh :7770` + `engram :8742`. Confirms the diagnosis and corrects two common assumptions.
**Ports:** `engram :8742` ✓ listening (healthy: `{"status":"ok","engine":"engram-runtime-native"}`), `neuron-fresh :7770` ✓, **`:7771` NOT listening.**
**Two distinct write failures (not one):**
1. **`/api/neuron/knowledge/capture` + memory remember** — handled **in-process by the soul** (`neuron-api.el` `handle_api_capture_knowledge` / remember → `engram_node_full(...)`). Live test: `POST …/knowledge/capture` returned `{"id":"2ccfc147…","ok":true}` but that id is **absent from `/api/graph/nodes` and `snapshot.json`** → the node corrupted/vanished. **This is exactly the `engram_node_full` wrapper bug this PR fixes.** It is NOT a `:7771` issue. → fixed by el PR #52 + soul rebuild.
2. **`/api/backlog`, `/api/memories`, `/api/knowledge`, `/api/artifacts`, `/api/projects`, `/api/imprints`** — `routes.el` proxies these to **`axon`** via `axon_get`/`axon_post` (base `SOUL_AXON` or default **`http://localhost:7771`**). `axon` = **`protocols/axon`, an unbuilt Rust crate**, not running → "Failed to connect to localhost port 7771." → needs axon stood up (separate Rust workstream) OR routes repointed.
**Architecture clarifications (so nobody chases the wrong port again):**
- The soul runs in **file-snapshot mode** (no `ENGRAM_URL` in `/tmp/soul-keepalive.sh`) → it uses `~/.neuron/engram/snapshot.json`, **not `engram :8742` live**. So writing to `:8742` does NOT make data visible to the soul the app talks to.
- `engram :8742` is its own EL service (`engram/src/server.el`) with a **working CRUD API**: `POST/GET/DELETE /api/nodes`, `/api/edges`, `/api/save`, `/api/load`, `/api/activate`, `/api/search`. Verified create+delete (`{"ok":true}`). **But** its `route_create_node` only reads `content/node_type/salience`**no label/tier/tags/metadata** — so it can't set `metadata.tier_source: canonical`.
- Minor EL bug in `engram/src/server.el route_create_node`: `if str_eq(node_type,""){ let node_type = "Memory" }` **shadows** (new local) instead of reassigning → the default never applies; same for `salience`. Worth fixing while in there.
**Verification plan (run after the soul rebuild lands):**
1. `POST /api/neuron/knowledge/capture {content,title,tier:canonical}` → capture the returned id.
2. `GET /api/neuron/knowledge/search?q=<term>` → confirm the node comes back with correct `node_type`/`metadata.tier_source`.
3. Confirm it survives a snapshot save (present in `snapshot.json`). Only then is the write "real."
4. Backlog: once `axon :7771` is up, repeat for `POST /api/backlog`.
**Net:** "make writes persist" needs (a) **this wrapper fix built into the soul** (capture) and (b) **`axon :7771` running** (backlog/artifacts/etc.). Neither was doable on Tim's box (no `elc`; `axon` is unbuilt Rust — out of scope per the no-Rust guardrail). No live writes/restarts were performed; engram probe node was created and deleted to verify the API.