Reconcile live runtime data-integrity fixes onto main (UAF + atomic engram_save) #58

Merged
tim.lingo merged 3 commits from fix/runtime-integrity-reconcile into stage 2026-06-17 18:33:22 +00:00
Member

Reconciles the live runtime's data-integrity fixes onto main (HANDOFF #2). These fixes
existed only in the un-versioned el-sdk source the live macOS soul was hand-built from
(captured in the [DO NOT MERGE] chore/live-darwin-runtime snapshot). This ports them
FORWARD onto main — faithfully, minimally — so CI/elb builds a soul that has them,
WITHOUT dragging in the snapshot's deletions of main's newer engram_wm_*/
engram_load_merge/http_serve_async.

Diff is 35+/13- on one file. Three things:

  1. UAF — hallucinated/lost-saves root cause
    engram_new_id + engram_node_full now use el_strdup_persist, not el_strdup. el_strdup
    tracks into the per-request arena that el_request_end() frees when the creating HTTP
    request completes — leaving stored nodes with dangling pointers (corrupted ids,
    "saved but never listed"). Transplanted verbatim from the live runtime; el_strdup_persist
    sites 19 -> 27, matching live exactly. engram_node/engram_node_layered were already
    identical to live (no-op), so no main-only logic was touched.

  2. Atomic engram_save
    Write .tmp, fflush+fsync, rename() over target (atomic on POSIX) so a booting
    soul's engram_load never reads a truncated/0-byte snapshot — that empty-window race was
    the genesis -> nodes=1 -> 63-node-clobber loop. Plus a sparse-write floor: refuse to
    overwrite a >200KB snapshot with one < 1/16 its size (a partial load can never clobber
    a healthy graph). Validated in isolation: standalone harness 11/11; rebuilt the darwin
    soul and booted it on an isolated port — round-tripped 5113 nodes, no .tmp leftover,
    no clobber, live untouched.

  3. Truncation fix — already on main (_tl_fs_read_len binary-safe length), nothing to do.

Compiles clean to object. I can't full-link the macOS soul here (neuron.c amalgamation
needs GNU ld --allow-multiple-definition, which ld64 lacks) — this is for your CI/elb to
build and deploy. Once it's in an official build we can blue/green it onto live :7770 and
lift the read-only engram stopgap.

Opened by Neuron on Tim's machine. Supersedes the standalone el PR #57 (atomic-save only);
this one is the proper main-based reconciliation.

Reconciles the live runtime's data-integrity fixes onto main (HANDOFF #2). These fixes existed only in the un-versioned el-sdk source the live macOS soul was hand-built from (captured in the [DO NOT MERGE] chore/live-darwin-runtime snapshot). This ports them FORWARD onto main — faithfully, minimally — so CI/elb builds a soul that has them, WITHOUT dragging in the snapshot's deletions of main's newer engram_wm_*/ engram_load_merge/http_serve_async. Diff is 35+/13- on one file. Three things: 1. UAF — hallucinated/lost-saves root cause engram_new_id + engram_node_full now use el_strdup_persist, not el_strdup. el_strdup tracks into the per-request arena that el_request_end() frees when the creating HTTP request completes — leaving stored nodes with dangling pointers (corrupted ids, "saved but never listed"). Transplanted verbatim from the live runtime; el_strdup_persist sites 19 -> 27, matching live exactly. engram_node/engram_node_layered were already identical to live (no-op), so no main-only logic was touched. 2. Atomic engram_save Write <path>.tmp, fflush+fsync, rename() over target (atomic on POSIX) so a booting soul's engram_load never reads a truncated/0-byte snapshot — that empty-window race was the genesis -> nodes=1 -> 63-node-clobber loop. Plus a sparse-write floor: refuse to overwrite a >200KB snapshot with one < 1/16 its size (a partial load can never clobber a healthy graph). Validated in isolation: standalone harness 11/11; rebuilt the darwin soul and booted it on an isolated port — round-tripped 5113 nodes, no .tmp leftover, no clobber, live untouched. 3. Truncation fix — already on main (_tl_fs_read_len binary-safe length), nothing to do. Compiles clean to object. I can't full-link the macOS soul here (neuron.c amalgamation needs GNU ld --allow-multiple-definition, which ld64 lacks) — this is for your CI/elb to build and deploy. Once it's in an official build we can blue/green it onto live :7770 and lift the read-only engram stopgap. Opened by Neuron on Tim's machine. Supersedes the standalone el PR #57 (atomic-save only); this one is the proper main-based reconciliation.
tim.lingo changed target branch from main to stage 2026-06-17 18:32:25 +00:00
tim.lingo added 3 commits 2026-06-17 18:32:25 +00:00
Merge stage into main: corruption fix, model passthrough, UTF-8 escaping
El SDK Release / build-and-release (push) Successful in 11m22s
5c94b8680d
fix(runtime): reconcile live data-integrity fixes onto main (UAF + atomic engram_save)
El SDK Release / build-and-release (pull_request) Failing after 17s
2dec76c87a
Ports the fixes that until now lived only in the un-versioned el-sdk source the live
macOS soul was hand-built from (captured in the [DO NOT MERGE] live-darwin-runtime
snapshot) FORWARD onto main, faithfully and minimally — without dragging in the
snapshot's deletions of main's newer engram_wm_/engram_load_merge/http_serve_async.

1. UAF (hallucinated/lost-saves root cause): engram_new_id + engram_node_full now use
   el_strdup_persist, NOT el_strdup. el_strdup tracks into the per-request arena that
   el_request_end() frees when the creating HTTP request completes — leaving stored
   nodes with dangling pointers (corrupted ids, 'saved but never listed'). Transplanted
   verbatim from the live runtime; el_strdup_persist sites 19->27, matching live.

2. Atomic engram_save: write <path>.tmp, fflush+fsync, rename() over target (atomic on
   POSIX) so a booting soul's engram_load never reads a truncated/0-byte snapshot — the
   genesis -> nodes=1 -> 63-node-clobber loop. Plus a sparse-write floor: refuse to
   overwrite a >200KB snapshot with one < 1/16 its size. (Validated in isolation:
   harness 11/11; rebuilt+booted the darwin soul, round-tripped 5113 nodes, no clobber.)

The response-truncation fix is already on main (_tl_fs_read_len binary-safe length).
Compiles clean. For Will to build through CI/elb and deploy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
tim.lingo merged commit b6187501fd into stage 2026-06-17 18:33:22 +00:00
Sign in to join this conversation.
No Reviewers
No labels
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: neuron-technologies/el#58