Files
neuron/MEMORY_RECALL_BUG.md
T
Tim Lingo 3bb17a5296 feat(soul): add safety module, expand connectors API, memory-recall bug notes
- safety.el/.elh: new safety module
- neuron-api.el, routes.el, soul.el, chat.el: connectors API expansion
- regenerated dist/ C artifacts
- MEMORY_RECALL_BUG.md: investigation notes

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-15 11:10:33 -05:00

8.9 KiB

Memory Recall Bug — Handoff for Will

Reported by: Tim (via the Neuron UI chat) Diagnosed by: Claude (Claude Code session), 2026-06-05 Symptom: The soul can't recall anything specific — e.g. "do you remember the jokes from that night with Will, Tim, and April?" → it has no idea, and correctly self-reports that either retrieval is failing or the memory was never captured.


TL;DR

The memories are almost certainly intact in the graph. The problem is the retrieval layer: engram_search_json and engram_activate_json return empty for every query, so the chat falls back to two hardcoded pinned nodes and effectively remembers nothing. Strongly looks like the embedding / search index was never built or isn't loaded at boot.

Separately: the soul daemon on :7770 was down at the end of the investigation (it had been up earlier in the session — it died/stopped partway through). Restart needed before any of this can be re-tested.


Evidence

All commands run against the live services during the session.

Search/activate return nothing — even for guaranteed-present terms

curl "http://127.0.0.1:8742/api/search?q=MUDCraft&limit=3"  -H "X-API-Key: ntn-user-2026"  →  []
curl "http://127.0.0.1:8742/api/search?q=neuron&limit=3"    -H "X-API-Key: ntn-user-2026"  →  []
curl "http://127.0.0.1:8742/api/search?q=Will&limit=3"      -H "X-API-Key: ntn-user-2026"  →  []
curl "http://127.0.0.1:8742/api/activate?q=jokes&depth=3"   -H "X-API-Key: ntn-user-2026"  →  {"results":[]}

# soul's in-process equivalents (port 7770) — also empty:
curl "http://127.0.0.1:7770/api/neuron/recall?query=neuron"            →  (empty)
curl "http://127.0.0.1:7770/api/neuron/knowledge/search?q=MUDCraft"    →  (empty)

But the raw data is present

curl "http://127.0.0.1:7770/api/graph/nodes?limit=2"
→ [{"id":"mem-30425134-...","content":"CGI ARCHITECTURE ? THREE LAYERS, MCP RETIRED ...

/api/graph/nodes is served by engram_scan_nodes_json(9999, 0) (routes.el:223-224) and returns hundreds of rich nodes. So node storage is fine — only the search/activation index is dead.

The two standalone-engram counters

curl "http://127.0.0.1:8742/api/stats"  →  {"node_count":0,"edge_count":0,"layer_count":5}

Note: the standalone engram process on :8742 reports 0 nodes, while the soul's in-process engram (:7770) has the data. Worth confirming which engram instance is the source of truth and whether they've diverged. (The :8742 process was also showing up as engram --help in ps, which is suspicious — may not be a real server instance.)


Root cause (where it breaks in code)

neuron/chat.el → engram_compile(intent) (lines 15-53) builds the entire memory context for every chat turn from exactly two sources:

let activate_json: String = engram_activate_json(intent, 5)   // returns []
let search_json:   String = engram_search_json(intent, 15)    // returns []

When both are empty, it falls back to two hardcoded nodes by literal ID (chat.el:29-41):

// "Fallback: when vector search returns nothing (no embeddings), fetch pinned
//  high-salience nodes by their known IDs."
let family_node = engram_get_node_json("knw-35940684-abc4-42f0-b942-818f66b1f69a")
let origin_node = engram_get_node_json("knw-729fc901-8335-44c4-9f3a-b150b4aa0915")

So today the soul's entire recallable memory in a chat = those two nodes. That's why it can't surface jokes, social moments, the dynamic with Tim/April, or anything else specific.

The comment ("when vector search returns nothing (no embeddings)") is the key hint: this fallback was written expecting the embedding index to sometimes be absent — and right now it's absent all the time.

Affected callers all funnel through the same two dead builtins:

  • handle_api_recall (neuron-api.el:118) — engram_search_json
  • handle_api_search_knowledge (neuron-api.el:135) — engram_search_json + engram_activate_json
  • engram_compile (chat.el:15) — both

Working callers use a different builtin (engram_scan_nodes_json / engram_scan_nodes_by_type_json), which is why graph/list views work but recall doesn't.


Fix options (Will's call)

Option 1 — Proper fix: rebuild/restore the embedding + activation index

engram_search_json and engram_activate_json are native runtime builtins. They're returning empty because (most likely) the vector/search index was never built or isn't loaded at boot, even though node storage loads fine. Investigate the engram boot path: does it build embeddings for loaded nodes? Is there an index file that's missing/stale? Fixing this restores recall everywhere at once. This is the real fix.

Option 2 — Pragmatic EL-level fallback (no native changes)

Since engram_scan_nodes_json() works, engram_compile could do a keyword scan when the vector path is empty: pull nodes, substring/token match the query against content + label, rank by overlap, return the top N. Restores basic recall even with the vector index down. ~20 lines of EL in engram_compile, but requires a soul rebuild + restart. Claude offered to write this patch for your review if you want it — say the word.

Tradeoff: keyword matching is much weaker than semantic recall (won't find "jokes" unless the node text literally contains joke-ish words), but it's strictly better than the current two-node fallback and needs no native/runtime work.


Also needs attention

  • Soul daemon (:7770) was down at end of session — restart and confirm it stays up.
  • Confirm the engram instance topology — :8742 standalone shows 0 nodes while the soul's in-process engram has the data. Make sure chat is reading the populated one and they haven't diverged.
  • Social memory weighting (Tim's deeper point): even once retrieval works, jokes / interpersonal moments may not be tagged or salience-weighted to surface as "important." Worth a look at how those get captured and scored — but that's secondary to getting retrieval working at all.

Daemon lifecycle — needs a supervisor (NEW, 2026-06-06)

The soul daemon crashed again the next day. It had been up earlier, then died on its own (not from any change). When it's down, the UI's Backlog / Artifacts / Knowledge / Graph / Memories tabs all go blank, because they read from :7770/api/graph/nodes. The chat also stops working. This is the second unexplained death in two days.

How it's currently run (fragile)

  • Binary: neuron/dist/neuron-fresh (compiled from the EL sources)
  • Launched manually as a bare background process (./neuron-fresh &) — no supervisor, no auto-restart, no crash logging beyond stdout. When it dies, it stays dead until a human notices the blank UI and restarts it.
  • Boot log only shows [http] listening on [::]:7770 — there's no captured stack/exit reason when it crashes, so we can't yet say why it's dying.

How I restarted it (for reference)

# snapshot lives at ~/.neuron/engram/snapshot.json (loaded on boot, ~9.7MB)
# ALWAYS back it up first — genesis boot re-saves it:
cp ~/.neuron/engram/snapshot.json ~/.neuron/engram/snapshot.backup-$(date +%Y%m%d-%H%M%S).json

cd neuron/dist
ANTHROPIC_API_KEY='<key>' NEURON_PORT=7770 ./neuron-fresh > /tmp/soul-restart.log 2>&1 &
# verify:
curl -s http://127.0.0.1:7770/health
# → {"status":"alive","cgi_id":"ntn-genesis","boot":2,"node_count":3660,"edge_count":14207,...}

After this, data came back: 3,660 nodes / 14,207 edges; Backlog 485, Memory 493, etc.

Recommendations for Will

  1. Put it under a supervisor so it auto-restarts on crash and logs exit codes:
    • macOS dev: a launchd LaunchAgent plist (KeepAlive=true), or brew services, or even a simple while true; do ./neuron-fresh; done wrapper with timestamped logs.
    • Prod/k8s already has entrypoint.sh + restart policy — the gap is the local dev run path.
  2. Capture crash diagnostics — redirect stdout/stderr to a rotating logfile and, if the EL runtime can, dump a reason on exit. Right now we're blind to the cause.
  3. Find the root cause of the crashes — two self-deaths in two days suggests a real bug (memory? an unhandled request? a panic in a native builtin?). The supervisor stops the symptom (blank UI) but not the underlying instability.
  4. Snapshot safety — genesis boot calls engram_save(snapshot) (soul.el:240,248). A crash mid-save could corrupt the 9.7MB memory file. Consider write-to-temp + atomic rename, and/or periodic timestamped backups, so a bad save can't lose Neuron's memory.

What was NOT touched

No backend EL code and no engram data were modified — the memory-recall diagnosis is read-only. The only operational action taken was restarting the already-existing neuron-fresh daemon (after backing up the snapshot) to bring the blank UI tabs back; no source or data was changed by that. All UI work this session was in neuron-ui and is unrelated to this bug.