Engram write-corruption: chat.el caller fix + full handoff #1
Reference in New Issue
Block a user
Delete Branch "fix/engram-write-corruption-handoff"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Caller fix (tier in node_type slot) + HANDOFF-engram-write-corruption.md with root cause, coercion mechanism, caller audit, validation, deploy runbook, and prune proposal. Pairs with el PR #52 (the wrapper fix). NOT yet built/deployed — needs elc build env + controlled daemon restart.
Added a Phase 0 live-runtime addendum to HANDOFF-engram-write-corruption.md (commit
2112d2f). Validated against the running system:Code Review — Engram write-corruption: chat.el caller fix + full handoff
Summary
This PR fixes a real, confirmed data-corruption bug:
chat.el:512was callingengram_node(content, "episodic", ...)which put a memory tier ("episodic") into thenode_typeslot, producing corrupted graph nodes. The fix upgrades the call to the correctengram_node_fullwith all eight fields properly populated. The HANDOFF document is thorough, accurate about what is and isn't done, and gives the next engineer (Will) everything needed to complete the deploy. Overall quality is high for a handoff-style PR.What was done well
chat.el:512-520— the fix itself is correct. The old three-argumentengram_node(content, "episodic", 0.6)call had"episodic"in thenode_typeslot; the newengram_node_fullcall uses"Conversation"fornode_typeand"Episodic"fortier, which matches the C contract documented inel_seed.h:204. The slot mapping is unambiguous and right.Complementary scope. The PR correctly identifies that this is the caller half of a two-part fix. el PR #52 corrects the wrapper signature and adds
engram_valid_node_type/engram_valid_tiervalidation; this PR corrects the call site. Neither PR alone is sufficient; together they close the write path end-to-end.HANDOFF document quality. Sections 1–5 are exceptionally clear: root cause with the coercion mechanism explained at the
el_val_t/uintptr_tlevel, a full caller audit with results, validation design rationale, a step-by-step deploy runbook including the keepalive trap (/tmp/soul-keepalive.shrespawn), and a concrete verification plan that requires end-to-end confirmation before the fix is considered done. This is the right level of detail for an unbuilt change.Data analysis (§6). Identifying ~107 corrupt nodes, confirming 0 are field-repairable, building a pruned snapshot, and explicitly NOT applying it while the daemon is live — this is responsible ops hygiene. The backup-first approach is correct.
Phase 0 addendum (§ addendum). Distinguishing the two distinct write-failure modes (in-process
engram_node_fullcorruption vs.axon :7771not running) prevents weeks of misdirected debugging. The architecture clarification about file-snapshot mode vs. HTTP mode is high-value.Security callout (§7). Flagging the plaintext
ANTHROPIC_API_KEYin/tmp/soul-keepalive.shwas the right thing to surface proactively.CI result. CI passed (
Neuron Soul CI / build— success in 3m17s). The build pipeline useselbto recompile from.elsources, so the fixedchat.elis what ends up in the binary; the staledist/chat.cis overwritten by CI and not the source of truth.Issues / Needs Work
BLOCKING —
SessionSummarynot in theengram_valid_node_typeallowlistFile:
neuron-api.el:372-376(caller) / el PR #52lang/runtime/engram.el(allowlist)neuron-api.elpasses"SessionSummary"asnode_typetoengram_node_full:The
engram_valid_node_typeallowlist added in el PR #52 does not include"SessionSummary". Once both PRs are merged and the soul is rebuilt, every call tohandle_api_consolidatethat has a non-empty summary will be silently rejected ([engram] REJECTED node write — invalid node_type 'SessionSummary') and the session summary write will be dropped. This is a regression introduced by the combination of the two PRs.Fix: Add
"SessionSummary"toengram_valid_node_typein el PR #52, or change the call inneuron-api.elto use an existing valid type such as"InternalStateEvent"and add"session-summary"to the tags. Whichever you choose, document it in the HANDOFF so the allowlist decision is recorded.Non-blocking — Static label
"soul:utterance"does not uniquify per utteranceFile:
chat.el:517The
handle_dharma_room_turnfix uses the fixed label"soul:utterance"for every utterance:If
engram_node_fulldeduplicates on label (asconv_history_persistintentionally does with"conv:history"), then only the most recent dharma-room utterance will exist in the graph at any time. This may be intentional (one live utterance node), but it differs fromauto_persistatchat.el:650-659which generates a unique"chat:" + ts_strlabel per exchange. If the intent is to accumulate a history of what the soul said across dharma-room turns, a timestamp suffix is needed. If the intent is to keep only the latest (low-value ephemeral utterances), a comment explaining this would prevent future confusion.Non-blocking —
discard_idfailure is not observable from the callerFile:
chat.el:516-520The return value of
engram_node_full(the new node id, or""on validation failure) is assigned todiscard_idand then ignored. With the new validation in el PR #52, a misconfigured call will print a[engram] REJECTEDlog line but the handler will return a success response to the client. This is acceptable for a best-effort record-keeping write, but it means write failures during the post-deploy verification (§5 of the runbook) will only be detectable by scanning daemon logs, not from the API response. The runbook should note this explicitly so whoever runs the verification step knows to tail logs, not just check the response.Non-blocking —
engram/src/server.el route_create_nodevariable shadowing (mentioned in HANDOFF, not fixed here)File:
engram/src/server.el(not in this PR)The HANDOFF correctly identifies that
if str_eq(node_type,""){ let node_type = "Memory" }creates a new local binding rather than reassigning the outer variable, so the default never applies. This is a separate code path from what this PR fixes, but it meansengram :8742's own CRUD API has a latent bug. It's out of scope here but should be a follow-up backlog item.Build result
Emacs is not available on the review environment (
emacs not found). This is a custom EL language (compiled byelcto C, then linked); standard Emacs byte-compilation is not applicable. The CI pipeline (Neuron Soul CI / build) ran against the head commit and succeeded in 3m17s, which meanselbsuccessfully compiled all.elsources including the fixedchat.elinto adist/neuronbinary. Thedist/*.cfiles checked into the repo are pre-existing snapshots and are not what gets deployed — the CI recompiles from source on every PR.The binary in
dist/chat.c:384still contains the oldengram_node(clean_response, EL_STR("episodic"), el_from_float(0.6))call (the artifact was not regenerated in this PR), but this is not a problem because the CI build step overwrites it from the corrected.elsource.Verdict: REQUEST_CHANGES
The
SessionSummaryallowlist gap is a blocking regression that will silently break session-summary writes once both PRs are merged. Everything else is high quality. Fix the allowlist in el PR #52 (or retype the call inneuron-api.el), confirm thesoul:utterancelabel design is intentional, and this is ready to merge.Reviewed — code looks good, CI verified locally. Approved to merge.
Handoff (auto, for the review pipeline) — prereq: merge #16 first so generated dist/ diffs are suppressed.
WHAT: chat.el caller fix (+9 -1) + handoff doc. Real source = 135 lines; only the 9-line fix needs review.
RISK: low — targeted fix to a known engram write-corruption path.
VERIFY: soul builds; engram write round-trips cleanly.
ORDER: merge early (small, isolated, real bug).
Pull request closed