Merge PR #1: Engram write-corruption: chat.el caller fix + full handoff
Deploy Soul to GKE / deploy (push) Failing after 12m33s
Neuron Soul CI / build (push) Failing after 12m43s

This commit is contained in:
2026-06-15 11:29:18 -05:00
2 changed files with 135 additions and 1 deletions
+126
View File
@@ -0,0 +1,126 @@
# Handoff: Engram EL write-path field corruption + silent writes
**For:** Will (backend / EL soul)
**From:** Tim (via Claude Code)
**Date:** 2026-06-08
**Status:** Root cause confirmed; source fixes applied locally (NOT built/deployed); data analyzed; prune proposed (NOT applied).
---
## TL;DR
The EL wrapper `engram_node_full` had a **stale signature** that didn't match the C primitive. Because `el_val_t` is an untyped machine word, the compiler coerced caller args to the wrong declared types and forwarded them **by position** into a C function whose positions mean different things → `tier` got ints, `importance/confidence` got strings, `label` got a float, etc. One caller (`chat.el`) also put a *tier* into the `node_type` slot.
Source fixes are done. **You need to:** review, build with `elc`, restart the soul, verify, and apply the prune (daemon stopped). Details below.
---
## 1. Root cause (confirmed)
**C contract** (`el/lang/el-compiler/runtime/el_seed.h:204`):
```
__engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)
```
**Old wrapper** (`el/lang/runtime/engram.el:15-17`) — stale schema, wrong names AND types:
```
fn engram_node_full(content: String, nt: String, sal: Float, imp: Float,
source: String, lang: String, ts: Int, tags: String)
```
**Coercion mechanism:** `el_val_t` is `uintptr_t` (`#define EL_STR(s) ((el_val_t)(uintptr_t)(s))`, `EL_INT(v) (v)`). The EL compiler binds each caller arg to the wrapper's *declared* param type (String→Float / String→Int coercion at the boundary), then the wrapper forwards **positionally**. Result for a correct-order caller `(content,"Memory","memory:remembered",sal,imp,conf,tier,tags)`:
- `label``sal` (a float)
- `importance` ← a String
- `confidence` ← a String
- `tier``ts` (the tier String coerced to Int) → **tier becomes an integer**
This matches the data exactly (see §6).
---
## 2. Fix applied — wrapper (`el/lang/runtime/engram.el`)
Corrected to match the C contract 1:1 (no coercion, no reorder):
```
fn engram_node_full(content: String, node_type: String, label: String,
salience: Float, importance: Float, confidence: Float,
tier: String, tags: String) -> String {
// validation (see §4), then:
return __engram_node_full(content, node_type, label, salience, importance, confidence, tier, tags)
}
```
## 3. Fix applied — caller audit
Audited every caller (`chat.el`, `awareness.el`, `soul.el`, `memory.el`, `routes.el`, `neuron-api.el`).
**All `engram_node_full` callers already use the correct order** — so the wrapper fix repairs them automatically. **One real caller bug** fixed:
`neuron/chat.el:512` was:
```
engram_node(clean_response, "episodic", el_from_float(0.6)) // "episodic" = a TIER in the node_type slot
```
Now:
```
engram_node_full(clean_response, "Conversation", "soul:utterance",
el_from_float(0.6), el_from_float(0.6), el_from_float(0.8),
"Episodic", utterance_tags)
```
## 4. Fix applied — validation (defense in depth, `engram.el`)
Added `engram_valid_node_type` / `engram_valid_tier` allowlists. Both `engram_node` and `engram_node_full` now **reject invalid values with `__println` + return `""`** (fail loud, never silently write a malformed node).
- node_type allowlist: Memory, Knowledge, Belief, Project, Tag, BacklogItem, Artifact, Conversation, ExecutionContext, InternalStateEvent, Self, Entity, Process, ConfigEntry, Concept, Imprint *(union of the spec list + types actually present in the store — trim if some are illegitimate).*
- tier allowlist: Semantic, Episodic, Working, Procedural, Canonical, Note, Lesson
- **Note:** `el_val_t` is untyped, so this catches wrong VALUES, not wrong TYPES. Type safety comes from the corrected signatures.
> All edits above are in the working tree on Tim's machine but **NOT compiled/deployed** and **NOT compile-verified** (no `elc` on that box).
---
## 5. DEPLOY RUNBOOK (your build env)
1. Pull the edited files: `el/lang/runtime/engram.el`, `neuron/chat.el`.
2. Build: `elc` (entry `neuron/soul.el`, import chain) → `neuron/dist/*.c`, then link as in `el/lang/install.sh` (`$(CC) $(CFLAGS) -o dist/neuron-fresh dist/*.c .../el_runtime.c -lcurl -lpthread`). Confirm `engram.el` recompiles into the import chain.
3. Restart the soul. **Note:** on Tim's box it's run by `/tmp/soul-keepalive.sh` (an auto-restart loop) → stop that loop before killing `neuron-fresh`, or it'll respawn the old binary.
4. **Verify (prove end-to-end):** write a node via the live API (POST `/api/memories` or the remember path) with an obvious throwaway label, then read it back and confirm `node_type` + `tier` are correct AND that it persisted (node_count increments; survives a snapshot save). There is **no delete endpoint** — clean up via the snapshot.
---
## 6. Data analysis + prune proposal (NOT applied)
- Snapshot: `~/.neuron/engram/snapshot.json`. **Backup made:** `~/.neuron/engram/snapshot.backup-20260608.json`.
- **~107 corrupt nodes** (node_type/tier not in the valid sets). node_type junk values: `''`, `'1'`, `'2'`, `'ntn-genesis'`, `'claude-opus-4-8'`, binary. tier junk: same + `'/Users/timlingo'`.
- **0 are field-repairable.** They're all genesis-bootstrap / binary detritus where *every* field (id/label/tier/tags) is corrupted together — 69× "You are ntn-genesis, a CGI.", 62× "ntn-genesis", ~70 binary garbage, plus a proxy URL + an API path that leaked into labels. No signal to reconstruct → **prune, don't fabricate.**
- **Proposal:** `~/.neuron/engram/snapshot.pruned.json` — 3,631 clean nodes (107 junk removed), edges intact (no dangling). Byte-verified: no *clean* node contains binary content, so re-encoding is lossless.
- **NOT applied** because the live daemon is **actively rewriting `snapshot.json`** (two reads returned different counts). Applying requires stopping the soul + keepalive, swapping in the pruned snapshot, then restarting. Do this in your controlled env with the backup retained.
---
## 7. Security heads-up (please action)
- `ANTHROPIC_API_KEY` is stored **in plaintext** in `/tmp/soul-keepalive.sh` — rotate it and move to a secret store.
- Internal infra leaked into node fields (`http://localhost:7771`, `/api/graph/edges?limit=5000`) — symptom of the same write bug; the prune removes those nodes.
## 8. Backlog of related gaps (separate from this fix)
- Soul chat loop reports **no tools** (`NONE`) / `NO_SHELL` — it narrates `curl`/`sqlite3` without executing. The capture REST path works, but the chat agent can't call it.
- **No `PUT`/`DELETE`** on knowledge nodes (`method not allowed`) — needed for UI edit/delete.
- No **source-conversation** edge on captured nodes — blocks "see source chat" in the UI.
- Writes have been **frozen since ~2026-04-29** (newest knowledge node) — nothing is being added in the current running state.
---
## ADDENDUM — Phase 0 live runtime findings (2026-06-08, verified against the running system)
Validated the write path end-to-end against `neuron-fresh :7770` + `engram :8742`. Confirms the diagnosis and corrects two common assumptions.
**Ports:** `engram :8742` ✓ listening (healthy: `{"status":"ok","engine":"engram-runtime-native"}`), `neuron-fresh :7770` ✓, **`:7771` NOT listening.**
**Two distinct write failures (not one):**
1. **`/api/neuron/knowledge/capture` + memory remember** — handled **in-process by the soul** (`neuron-api.el` `handle_api_capture_knowledge` / remember → `engram_node_full(...)`). Live test: `POST …/knowledge/capture` returned `{"id":"2ccfc147…","ok":true}` but that id is **absent from `/api/graph/nodes` and `snapshot.json`** → the node corrupted/vanished. **This is exactly the `engram_node_full` wrapper bug this PR fixes.** It is NOT a `:7771` issue. → fixed by el PR #52 + soul rebuild.
2. **`/api/backlog`, `/api/memories`, `/api/knowledge`, `/api/artifacts`, `/api/projects`, `/api/imprints`** — `routes.el` proxies these to **`axon`** via `axon_get`/`axon_post` (base `SOUL_AXON` or default **`http://localhost:7771`**). `axon` = **`protocols/axon`, an unbuilt Rust crate**, not running → "Failed to connect to localhost port 7771." → needs axon stood up (separate Rust workstream) OR routes repointed.
**Architecture clarifications (so nobody chases the wrong port again):**
- The soul runs in **file-snapshot mode** (no `ENGRAM_URL` in `/tmp/soul-keepalive.sh`) → it uses `~/.neuron/engram/snapshot.json`, **not `engram :8742` live**. So writing to `:8742` does NOT make data visible to the soul the app talks to.
- `engram :8742` is its own EL service (`engram/src/server.el`) with a **working CRUD API**: `POST/GET/DELETE /api/nodes`, `/api/edges`, `/api/save`, `/api/load`, `/api/activate`, `/api/search`. Verified create+delete (`{"ok":true}`). **But** its `route_create_node` only reads `content/node_type/salience`**no label/tier/tags/metadata** — so it can't set `metadata.tier_source: canonical`.
- Minor EL bug in `engram/src/server.el route_create_node`: `if str_eq(node_type,""){ let node_type = "Memory" }` **shadows** (new local) instead of reassigning → the default never applies; same for `salience`. Worth fixing while in there.
**Verification plan (run after the soul rebuild lands):**
1. `POST /api/neuron/knowledge/capture {content,title,tier:canonical}` → capture the returned id.
2. `GET /api/neuron/knowledge/search?q=<term>` → confirm the node comes back with correct `node_type`/`metadata.tier_source`.
3. Confirm it survives a snapshot save (present in `snapshot.json`). Only then is the write "real."
4. Backlog: once `axon :7771` is up, repeat for `POST /api/backlog`.
**Net:** "make writes persist" needs (a) **this wrapper fix built into the soul** (capture) and (b) **`axon :7771` running** (backlog/artifacts/etc.). Neither was doable on Tim's box (no `elc`; `axon` is unbuilt Rust — out of scope per the no-Rust guardrail). No live writes/restarts were performed; engram probe node was created and deleted to verify the API.
+9 -1
View File
@@ -520,7 +520,15 @@ fn handle_dharma_room_turn(body: String) -> String {
// Record what the soul said not where it was or with whom. Experience
// accumulates in the engram through the content of what was said.
let snap_path: String = state_get("soul_snapshot_path")
let discard_id: String = engram_node(clean_response, "episodic", el_from_float(0.6))
// Record what the soul said as a Conversation node with an Episodic tier. (Was:
// engram_node(content, "episodic", ...) which wrongly put a TIER into the node_type
// slot that's why nodes showed node_type="episodic". Use the full, correct contract.)
let utterance_tags: String = "[\"soul-utterance\",\"episodic\"]"
let discard_id: String = engram_node_full(
clean_response, "Conversation", "soul:utterance",
el_from_float(0.6), el_from_float(0.6), el_from_float(0.8),
"Episodic", utterance_tags
)
if !str_eq(snap_path, "") {
let discard_save: String = engram_save(snap_path)
}