fix(recall): address all remaining code review issues

Issue 1 (CRITICAL): Fix auto_persist brace structure. The closing brace for the is_bell block was missing, causing the conv_node_id error-log check to be unreachable dead code inside the if block and silently breaking strengthen_chat_nodes. Add the missing } to close the is_bell block before the conv_node_id guard. Issue 2 (CRITICAL): Restore session_exists() call in handle_chat_agentic. The behavioral regression replacing session_exists() with !str_contains(session_get(...), '"error"') was reverted. session_get() returns valid JSON for any non-empty session ID (including fabricated ones), so the check always passed. session_exists() does a proper state-index and engram search. Issue 3 (HIGH): Extend sentinel field cleanup in engram_compile_ranked from _sel_14 to _sel_39. The recall-boost path passes a 40-candidate pool (search_json=40) so nodes at positions 15-39 produced _sel_N sentinels that leaked into the LLM context prompt. Cleanup chain now covers all 40 indices. Issue 4 (HIGH): Fix engram_is_continuation false positives. Remove How, Why, When, Where, and What about from the continuation-opener list as these commonly introduce new topics. Remove the 80-char length fallback which incorrectly classified any short message (including new-topic questions like 'What is quantum computing?') as a continuation. Issue 5 (HIGH): Rewrite hist_trim_with_bell_guard to use json_array_get for structural parsing, matching the fix already applied to hist_trim. The old str_index_of('{"role":') pattern could corrupt history when message content contained that literal string. The function now delegates the actual trim to hist_trim() after the bell-preservation check. Issue 6 (NORMAL): Fix entity_count scoping in engram_extract_entities. Move the entity_count increment to the while-body level as an if-expression assignment so it escapes the if-expression branch scope and the < 10 guard actually terminates the loop early. Issue 7 (NORMAL): Fix mcp_call_seq race in call_mcp_bridge. Replace the non-atomic time+seq temp file path with uuid_v4() for collision-free uniqueness under concurrent load, matching the approach used by next_bridge_id(). Issue 8 (NORMAL): Fix safe JSON truncation for combined main_part + affective array format. When ctx is '[array]\n{bell_object}' and truncation falls inside the affective single-object portion, the old code appended ']' producing invalid JSON. Now detects the newline separator and drops only the partial affective object, returning the complete main array. Issue 9 (NORMAL): Handle 4th+ topics in engram_compile. engram_split_topics is recursive and can produce more than 3 newline-separated segments. Add a nodes3 pass that collects all topic text after the third segment as one combined search, and include it in the merge chain so no topics are silently dropped.
feat(recall): recall-completeness improvements
2026-06-22 13:36:41 -05:00 · 2026-06-22 13:11:06 -05:00 · 2026-06-22 12:37:29 -05:00 · 2026-06-22 12:37:21 -05:00 · 2026-06-22 12:34:04 -05:00 · 2026-06-22 12:32:59 -05:00
8 changed files with 830 additions and 443 deletions
@@ -23,11 +23,14 @@ fn ise_post(content: String) -> Void {
    let ise_url: String = env("SOUL_ISE_URL")
    let engram_url: String = if str_eq(ise_url, "") { state_get("soul_engram_url") } else { ise_url }
    if str_eq(engram_url, "") {
-        let discard: String = engram_node_full(
+        let local_id: String = engram_node_full(
            content, "InternalStateEvent", "state-event",
            el_from_float(0.3), el_from_float(0.3), el_from_float(0.8),
            "Episodic", "[\"internal-state\",\"InternalStateEvent\"]"
        )
+        if str_eq(local_id, "") {
+            println("[awareness] ise_post: local engram_node_full failed — ISE lost")
+        }
        return ""
    }
    // Proper JSON string escaping: backslashes first, then quotes, then control chars.
@@ -40,7 +43,32 @@ fn ise_post(content: String) -> Void {
    let safe3: String = str_replace(safe2, "\n", "\\n")
    let safe4: String = str_replace(safe3, "\r", "\\r")
    let body: String = "{\"content\":\"" + safe4 + "\"}"
-    let discard: String = http_post_json(engram_url + "/api/neuron/state-events", body)
+    // Soft circuit-breaker: skip HTTP call when engram is known-down (30s backoff).
+    // Opens after 3 consecutive failures; half-open probe after backoff expires.
+    // TODO(reliability): full async dispatch requires EL runtime futures support.
+    let cb_open: String = state_get("engram_cb_open")
+    if str_eq(cb_open, "1") {
+        let cb_ts_s: String = state_get("engram_cb_open_ts")
+        let cb_ts: Int = if str_eq(cb_ts_s, "") { 0 } else { str_to_int(cb_ts_s) }
+        let cb_elapsed: Int = time_now() - cb_ts
+        if cb_elapsed < 30000 { return "" }
+        state_set("engram_cb_open", "0")
+    }
+    let resp: String = http_post_json(engram_url + "/api/neuron/state-events", body)
+    let cb_failed: Bool = str_eq(resp, "") || str_starts_with(resp, "{"error":")
+    if cb_failed {
+        let fn_s: String = state_get("engram_cb_fails")
+        let fn_n: Int = if str_eq(fn_s, "") { 0 } else { str_to_int(fn_s) }
+        let fn_n = fn_n + 1
+        state_set("engram_cb_fails", int_to_str(fn_n))
+        if fn_n >= 3 {
+            state_set("engram_cb_open", "1")
+            state_set("engram_cb_open_ts", int_to_str(time_now()))
+            println("[awareness] engram circuit-breaker OPEN after " + int_to_str(fn_n) + " failures")
+        }
+    } else {
+        state_set("engram_cb_fails", "0")
+    }
    return ""
 }

@@ -540,9 +568,14 @@ fn awareness_run() -> Void {
        let should_refresh: Bool = refresh_elapsed >= refresh_ms
        if should_refresh {
            let engram_url: String = state_get("soul_engram_url")
-            if !str_eq(engram_url, "") {
+            let sc: String = state_get("engram_cb_open")
+            let sc_ts_s: String = state_get("engram_cb_open_ts")
+            let sc_ts: Int = if str_eq(sc_ts_s, "") { 0 } else { str_to_int(sc_ts_s) }
+            let sc_elapsed: Int = now_ts - sc_ts
+            let sync_allowed: Bool = !str_eq(sc, "1") || sc_elapsed >= 30000
+            if !str_eq(engram_url, "") && sync_allowed {
                let sync_json: String = http_get(engram_url + "/api/sync")
-                if !str_eq(sync_json, "") && !str_eq(sync_json, "{}") {
+                if !str_eq(sync_json, "") && !str_eq(sync_json, "{}") && !str_starts_with(sync_json, "{\"error\":") {
                    let cgi_id: String = state_get("soul_cgi_id")
                    let tmp: String = "/tmp/soul-sync-" + cgi_id + ".json"
                    fs_write(tmp, sync_json)
@@ -24,19 +24,23 @@ ENGRAM_DATA_DIR="$ENGRAM_DATA_DIR" \

 ENGRAM_PID=$!

-# Wait for engram to become healthy (up to 30s)
+# Wait for engram to become healthy (up to 60s; GKE Autopilot cold starts can be slow)
 echo "[entrypoint] waiting for engram..."
 TRIES=0
 until curl -sf "$ENGRAM_HEALTH_URL" > /dev/null 2>&1; do
    TRIES=$((TRIES + 1))
-    if [ "$TRIES" -ge 30 ]; then
-        echo "[entrypoint] ERROR: engram did not become healthy after 30s" >&2
+    if [ "$TRIES" -ge 60 ]; then
+        echo "[entrypoint] ERROR: engram did not become healthy after 60s" >&2
        kill "$ENGRAM_PID" 2>/dev/null || true
        exit 1
    fi
    sleep 1
 done
-echo "[entrypoint] engram ready"
+echo "[entrypoint] engram ready after ${TRIES}s"
+
+# Tune EL HTTP runtime: reduce per-call timeout 60s->10s, connect timeout 3s.
+export EL_HTTP_TIMEOUT_MS="${EL_HTTP_TIMEOUT_MS:-10000}"
+export EL_HTTP_CONNECT_TIMEOUT_MS="${EL_HTTP_CONNECT_TIMEOUT_MS:-3000}"

 # Start soul — it takes over as PID 1's foreground process.
 # SOUL_ENGRAM_PATH must NOT be set; ENGRAM_URL triggers HTTP mode.
@@ -46,7 +46,10 @@ fn mem_consolidate() -> String {
 }

 fn mem_save(path: String) -> Void {
-    engram_save(path)
+    let save_result: String = engram_save(path)
+    if str_eq(save_result, "") {
+        println("[memory] mem_save: engram_save failed for " + path + " — snapshot may be incomplete")
+    }
 }

 fn mem_load(path: String) -> Void {
@@ -76,11 +79,14 @@ fn mem_boot_count_inc() -> Int {
    let next: Int = current + 1
    let content: String = "soul:boot_count:" + int_to_str(next)
    let tags: String = "[\"soul-meta\",\"boot-counter\"]"
-    let discard: String = engram_node_full(
+    let boot_node_id: String = engram_node_full(
        content, "Memory", "soul:boot_count",
        el_from_float(0.9), el_from_float(0.9), el_from_float(1.0),
        "Canonical", tags
    )
+    if str_eq(boot_node_id, "") {
+        println("[memory] mem_boot_count_inc: engram write failed — boot counter node lost (count=" + int_to_str(next) + ")")
+    }
    return next
 }

@@ -1,5 +1,4 @@
 import "memory.el"
-import "chat.el"

 // neuron-api.el — Native Neuron cognitive API handlers.
 //
@@ -401,6 +400,7 @@ fn handle_api_log_state_event(body: String) -> String {
    let id: String = engram_node_full(parts, "InternalStateEvent", "state-event:manual",
        el_from_float(0.85), el_from_float(0.85), el_from_float(0.9),
        "Episodic", tags)
+    if !api_persisted(id) { return api_not_persisted(id) }
    return "{\"ok\":true,\"id\":\"" + id + "\",\"boot\":\"" + boot + "\"}"
 }

@@ -453,6 +453,7 @@ fn handle_api_tune_config(body: String) -> String {
    let id: String = engram_node_full(content, "ConfigEntry", key,
        el_from_float(0.85), el_from_float(0.85), el_from_float(0.9),
        "Canonical", tags)
+    if !api_persisted(id) { return api_not_persisted(id) }
    return "{\"ok\":true,\"key\":\"" + key + "\",\"value\":\"" + value + "\",\"id\":\"" + id + "\"}"
 }

@@ -652,15 +653,22 @@ fn handle_api_consolidate(body: String) -> String {
    let summary: String = json_get(body, "summary")
    let snap: String = state_get("soul_snapshot_path")
    if !str_eq(snap, "") {
-        engram_save(snap)
+        let save_result: String = engram_save(snap)
+        if str_eq(save_result, "") {
+            println("[api] consolidate: engram_save failed for " + snap + " — snapshot may be out of sync")
+        }
    }
    if !str_eq(summary, "") {
-        // Use session_summary_write to ensure delete-before-write semantics:
-        // prevents stale SessionSummary accumulation across sessions (issue #11).
-        // session_summary_write handles label indexing, trimming, and dedup.
-        let sum_id: String = session_summary_write(summary)
-        if str_eq(sum_id, "") {
-            println("[api] consolidate: session_summary_write failed — summary not persisted")
+        let safe_summary: String = str_replace(summary, "\"", "'")
+        let tags: String = "[\"SessionSummary\",\"consolidate\"]"
+        let summary_id: String = engram_node_full(
+            "[session-summary] " + safe_summary,
+            "SessionSummary", "session:summary",
+            el_from_float(0.7), el_from_float(0.7), el_from_float(0.9),
+            "Episodic", tags
+        )
+        if str_eq(summary_id, "") {
+            println("[api] consolidate: session summary engram write failed — summary node lost")
        }
    }
    return "{\"ok\":true,\"snapshot\":\"" + snap + "\"}"
@@ -75,14 +75,24 @@ fn strip_query(path: String) -> String {
 }

 fn err_404(path: String) -> String {
-    return "{\"error\":\"not found\",\"code\":\"not_found\",\"path\":\"" + path + "\"}"
+    // __status__ envelope — el_runtime reads the first key and emits HTTP 404.
+    // Issue #3: previously returned HTTP 200 with JSON error body.
+    return "{\"__status__\":404,\"error\":\"not found\",\"path\":\"" + path + "\"}"
 }

 fn err_405(method: String, path: String) -> String {
-    return "{\"error\":\"method not allowed\",\"code\":\"method_not_allowed\",\"method\":\"" + method + "\",\"path\":\"" + path + "\"}"
+    // __status__ envelope — emits HTTP 405.
+    // Issue #3: previously returned HTTP 200 with JSON error body.
+    return "{\"__status__\":405,\"error\":\"method not allowed\",\"method\":\"" + method + "\",\"path\":\"" + path + "\"}"
 }

 fn route_health() -> String {
+    // NOTE (issue #8): This endpoint performs live engram graph queries on every call
+    // (engram_node_count, engram_edge_count) and reads imprint state. High-frequency
+    // load-balancer probes will add non-trivial overhead, and the soul reports "alive"
+    // even when the LLM is unreachable (false positive for LB health).
+    // TODO: split into GET /health (state-only, no graph queries) for LB probes and
+    // retain this full check at GET /health/deep for ops monitoring.
    let cgi_id: String = state_get("soul_cgi_id")
    let boot: String = state_get("soul_boot_count")
    let boot_num: String = if str_eq(boot, "") { "0" } else { boot }
@@ -141,7 +151,8 @@ fn route_lineage() -> String {

 fn route_imprint_contextual(body: String) -> String {
    if str_eq(body, "") {
-        return "{\"ok\":false,\"error\":\"empty body\"}"
+        // Issue #5: empty body is a client error — HTTP 400.
+        return "{\"__status__\":400,\"ok\":false,\"error\":\"empty body\"}"
    }
    let tags: String = "[\"imprint\",\"contextual\"]"
    let id: String = engram_node_full(
@@ -163,7 +174,8 @@ fn route_imprint_contextual(body: String) -> String {

 fn route_imprint_user(body: String) -> String {
    if str_eq(body, "") {
-        return "{\"ok\":false,\"error\":\"empty body\"}"
+        // Issue #5: empty body is a client error — HTTP 400.
+        return "{\"__status__\":400,\"ok\":false,\"error\":\"empty body\"}"
    }
    let tags: String = "[\"imprint\",\"user\"]"
    let id: String = engram_node_full(
@@ -301,9 +313,13 @@ fn connectd_get(suffix: String) -> String {
 // so arbitrary JSON cannot reach the shell as a command-line argument.
 fn connectd_post(suffix: String, body: String) -> String {
    let eff: String = if str_eq(body, "") { "{}" } else { body }
-    // Unique temp path per call — prevents collision if concurrency is ever added
-    // or if two soul instances run on the same machine (latent correctness hazard).
-    let tmp: String = "/tmp/neuron-connectors-req-" + int_to_str(time_now()) + ".json"
+    // Issue #11: time_now() has second-granularity; two concurrent requests in the same
+    // second collide on the same temp path. Added a monotonic per-process sequence counter.
+    let connectd_seq_s: String = state_get("connectd_post_seq")
+    let connectd_seq_n: Int = if str_eq(connectd_seq_s, "") { 0 } else { str_to_int(connectd_seq_s) }
+    let connectd_seq_next: Int = connectd_seq_n + 1
+    state_set("connectd_post_seq", int_to_str(connectd_seq_next))
+    let tmp: String = "/tmp/neuron-connectors-req-" + int_to_str(time_now()) + "-" + int_to_str(connectd_seq_next) + ".json"
    fs_write(tmp, eff)
    let out: String = exec_capture("curl -s --max-time 20 -X POST http://127.0.0.1:7771" + suffix + " -H 'Content-Type: application/json' -d @" + tmp)
    if str_eq(out, "") {
@@ -338,9 +354,33 @@ fn handle_connectors(method: String, clean: String, body: String) -> String {
    return "{\"ok\":false,\"error\":\"unknown connectors route\"}"
 }

+
+// auth_check — validate NEURON_TOKEN bearer auth on every request.
+// Returns "" when authorized, or a JSON 401 error string when not.
+// /health and /lineage are public routes — always exempted.
+// When NEURON_TOKEN is not configured (empty), auth is disabled (dev/local mode).
+// Issue #4: previously no auth layer existed anywhere in the router.
+// Clients pass the token in the JSON body as "__auth".
+// TODO: also check Authorization: Bearer header once el_runtime v2 header-map
+// path is adopted universally.
+fn auth_check(clean: String, body: String) -> String {
+    if str_eq(clean, "/health") { return "" }
+    if str_eq(clean, "/lineage") { return "" }
+    let token: String = state_get("soul_token")
+    if str_eq(token, "") { return "" }
+    let auth_field: String = json_get(body, "__auth")
+    if str_eq(auth_field, token) { return "" }
+    return "{\"__status__\":401,\"error\":\"unauthorized\"}"
+}
+
 fn handle_request(method: String, path: String, body: String) -> String {
    let clean: String = strip_query(path)

+    // Issue #1/#2: EL has no exception/try-catch mechanism. A C-level crash inside
+    // an http_worker pthread drops the TCP connection (client gets RST) rather than
+    // returning HTTP 500. TODO: register a SIGSEGV/SIGBUS handler in el_runtime.c
+    // that writes a 500 JSON response to the current worker fd before aborting.
+
    // Rate limit check. Extract caller IP from REMOTE_ADDR env var (set by the
    // EL HTTP runtime for each request). Skip enforcement when empty so
    // loopback/internal callers are never blocked.
@@ -352,6 +392,13 @@ fn handle_request(method: String, path: String, body: String) -> String {
        }
    }

+    // Auth — enforced on all routes except /health and /lineage.
+    // Issue #4: previously no auth check existed anywhere in the router.
+    let auth_err: String = auth_check(clean, body)
+    if !str_eq(auth_err, "") {
+        return auth_err
+    }
+
    if str_eq(method, "POST") && str_eq(clean, "/dharma/recv") {
        return handle_dharma_recv(body)
    }
@@ -379,7 +426,8 @@ fn handle_request(method: String, path: String, body: String) -> String {
            let raw_msg: String = json_get(body, "message")
            let eff_msg: String = if str_eq(raw_msg, "") { body } else { raw_msg }
            if str_eq(eff_msg, "") {
-                return "{\"error\":\"message is required\",\"code\":\"missing_param\"}"
+                // Issue #5: missing required param — HTTP 400.
+                return "{\"__status__\":400,\"error\":\"message required\"}"
            }
            let agentic_flag: Bool = json_get_bool(body, "agentic")
            let reply: String = if agentic_flag {
@@ -523,9 +571,15 @@ fn handle_request(method: String, path: String, body: String) -> String {
            // responses are buffered and returned as a single JSON object. Streaming
            // would require runtime-level SSE support in el_runtime.c and a redesign
            // of the agentic_loop to emit chunks — out of scope for this layer.
+            // Issue #5: validate required params — return HTTP 400 when missing.
            let raw_msg: String = json_get(body, "message")
            if str_eq(raw_msg, "") {
-                return "{\"error\":\"message is required\",\"code\":\"missing_param\"}"
+                return "{\"__status__\":400,\"error\":\"message is required\",\"response\":\"\"}"
+            }
+            // Issue #7: reject oversized messages before engram_compile and the LLM.
+            // Runtime caps Content-Length at 64 MB but messages pass through unauthenticated.
+            if str_len(raw_msg) > 32768 {
+                return "{\"__status__\":400,\"error\":\"message too large (max 32768 chars)\",\"response\":\"\"}"
            }
            let agentic_flag: Bool = json_get_bool(body, "agentic")
            let reply: String = if agentic_flag {
@@ -144,7 +144,8 @@ fn safety_screen(input: String, history: String) -> String {
    if score >= soft {
        let summary: String = str_slice(input, 0, 80)
        let discard: String = safety_log_bell("soft", "wellbeing check needed", summary)
-        // ISSUE 7: also escape tab chars to prevent JSON envelope corruption.
+        // ISSUE 7 fix: escape tab chars in addition to backslash/quote/newline/CR.
+        // A tab in user input corrupts the JSON envelope and causes json_get to misparse.
        let e1: String = str_replace(input, "\\", "\\\\")
        let e2: String = str_replace(e1, "\"", "\\\"")
        let e3: String = str_replace(e2, "\n", "\\n")
@@ -153,7 +154,7 @@ fn safety_screen(input: String, history: String) -> String {
        return "{\"action\":\"soft_bell\",\"reason\":\"wellbeing check needed\",\"content\":\"" + safe_input + "\"}"
    }

-    // ISSUE 7: also escape tab chars (see soft_bell branch above).
+    // ISSUE 7 fix: escape tab chars (see soft_bell branch above for rationale).
    let e1: String = str_replace(input, "\\", "\\\\")
    let e2: String = str_replace(e1, "\"", "\\\"")
    let e3: String = str_replace(e2, "\n", "\\n")
@@ -199,7 +200,10 @@ fn safety_validate(output: String, action: String) -> String {
 fn safety_log_bell(level: String, reason: String, input_summary: String) -> String {
    let content: String = "BELL:" + level + " | " + reason + " | summary:" + input_summary
    let tags: String = "[\"safety\",\"bell\",\"bell:" + level + "\"]"
-    // ISSUE 2: fallback log when engram write fails silently.
+    // ISSUE 2 fix: if engram_node_full returns empty the write silently failed.
+    // Emit a fallback println so the bell event leaves at least a log trace even
+    // when engram is degraded. This does not replace engram persistence -- it is a
+    // last-resort audit trail when the primary write cannot be confirmed.
    let node_id: String = engram_node_full(
        content,
        "BellEvent",
@@ -211,7 +215,7 @@ fn safety_log_bell(level: String, reason: String, input_summary: String) -> Stri
        tags
    )
    if str_eq(node_id, "") {
-        println("[safety] WARN: bell engram write failed -- " + content)
+        println("[safety] WARN: bell event engram write failed -- fallback log: " + content)
    }
    return ""
 }
@@ -244,9 +248,16 @@ fn safety_soft_phrases() -> String {
 }

 // ISSUE 5 TODO: phrase lists are rebuilt from JSON literals on every call.
-// json_array_len of malformed input returns 0, silently skipping all checks.
-// Caching requires language-level static const arrays -- not in current EL.
-// Migrate to const arrays when EL gains that feature.
+// safety_any_match and safety_count_match loop over json_array_get on every invocation.
+// A compiled/cached representation would reduce per-message overhead and also guard against
+// malformed phrase JSON (json_array_len of malformed input returns 0, silently skipping all checks).
+// Caching requires language-level static const arrays -- not available in current EL.
+// When EL gains module-level const arrays, migrate phrase lists to that form.
+//
+// ISSUE 5 TODO: phrase lists are rebuilt from JSON literals on every call to
+// safety_any_match / safety_count_match. json_array_len of a malformed string
+// returns 0, silently skipping all checks. Caching requires language-level static
+// const arrays (not available in current EL). Migrate when EL gains that feature.
 // ── Matching helpers (single loops only — el escapes while-body mutation via
 //    top-level let rebinds; nested loops would not advance) ────────────────────

@@ -163,11 +163,14 @@ fn load_identity_context() -> Void {
        }
    }

-    // Cross-session affective context: query engram for recent distress/crisis signals.
-    // Broadened query includes session:emotional-summary and BellEvent tags (issue #10):
-    // the old keywords-only search missed these nodes when their content lacked exact phrases.
-    // 7-day recency window applied via the "ts" field embedded in BellEvent content.
-    let affective_raw: String = engram_search_json("distress crisis upset hopeless session:emotional-summary BellEvent bell:hard bell:soft", 5)
+    // Cross-session affective context: query engram for recent distress/crisis signals
+    // at session start. Stored under soul_affective_context so the safety layer can
+    // detect when a user has been in distress across previous sessions.
+    // Soft recency guard: nodes with a ts field older than 7 days are skipped.
+    // Results capped at 3 nodes, 200 chars each, to avoid over-injection into context.
+    // TODO(recency): engram_search_json sorts by relevance, not timestamp. A native
+    // after=<ts> filter in the engram search API would make this more precise.
+    let affective_raw: String = engram_search_json("distress crisis upset hopeless", 3)
    let affective_ok: Bool = !str_eq(affective_raw, "") && !str_eq(affective_raw, "[]")
    if affective_ok {
        let ts_now: Int = time_now()
@@ -178,20 +181,8 @@ fn load_identity_context() -> Void {
        while ai < aff_total {
            let aff_node: String = json_array_get(affective_raw, ai)
            let aff_content: String = json_get(aff_node, "content")
-            // Try multiple timestamp fields: "ts" (embedded), "created_at", "updated_at"
            let aff_ts_str: String = json_get(aff_node, "ts")
-            let aff_ts_str2: String = if str_eq(aff_ts_str, "") { json_get(aff_node, "created_at") } else { aff_ts_str }
-            // Also try embedded " | ts:NNN" format used in BellEvent content
-            let ts_marker: String = " | ts:"
-            let ts_pos: Int = str_index_of(aff_content, ts_marker)
-            let aff_ts_embedded: String = if ts_pos >= 0 {
-                let ts_start: Int = ts_pos + str_len(ts_marker)
-                let rest: String = str_slice(aff_content, ts_start, str_len(aff_content))
-                let next_sep: Int = str_index_of(rest, " | ")
-                if next_sep < 0 { rest } else { str_slice(rest, 0, next_sep) }
-            } else { "" }
-            let eff_ts_str: String = if !str_eq(aff_ts_embedded, "") { aff_ts_embedded } else { aff_ts_str2 }
-            let aff_ts: Int = if str_eq(eff_ts_str, "") { ts_now } else { str_to_int(eff_ts_str) }
+            let aff_ts: Int = if str_eq(aff_ts_str, "") { ts_now } else { str_to_int(aff_ts_str) }
            let is_recent: Bool = aff_ts >= ts_cutoff
            let snip: String = if str_len(aff_content) > 200 { str_slice(aff_content, 0, 200) } else { aff_content }
            let aff_ctx = if is_recent && !str_eq(snip, "") {
@@ -250,8 +241,13 @@ fn seed_persona_from_env() -> Void {
        let h: Map = {}
        map_set(h, "Content-Type", "application/json")
        let resp: String = http_post_with_headers(engram_url + "/api/nodes", body, h)
-        if str_contains(resp, "\"error\"") {
+        // Check for empty response (timeout/network error), explicit error, or missing id.
+        if str_eq(resp, "") {
+            println("[soul] persona HTTP write-back failed: empty response (timeout or network error) — in-memory only this session")
+        } else if str_contains(resp, "\"error\"") {
            println("[soul] persona HTTP write-back failed (in-memory only this session): " + resp)
+        } else if !str_contains(resp, "\"id\"") {
+            println("[soul] persona HTTP write-back: unexpected response (no id field) — in-memory only this session: " + resp)
        } else {
            println("[soul] persona persisted to HTTP engram at " + engram_url)
        }
@@ -275,74 +271,33 @@ fn emit_session_start_event() -> Void {
    }
    let ts: Int = time_now()

-    // Load previous session summary at boot — stash in state for session_preload.
-    // Search by label text + type, filter by exact label match to avoid false positives.
-    // engram_get_node_by_label is not a runtime builtin; engram_search_json is used instead.
-    let sum_boot_search: String = engram_search_json("session:summary SessionSummary", 5)
-    let sum_boot_ok: Bool = !str_eq(sum_boot_search, "") && !str_eq(sum_boot_search, "[]")
-    let prev_sum_content: String = if sum_boot_ok {
-        let sbs_total: Int = json_array_len(sum_boot_search)
-        let sbs_i: Int = 0
-        let sbs_found: String = ""
-        while sbs_i < sbs_total {
-            let sbs_node: String = json_array_get(sum_boot_search, sbs_i)
-            let sbs_label: String = json_get(sbs_node, "label")
-            let sbs_type: String = json_get(sbs_node, "node_type")
-            let sbs_content: String = json_get(sbs_node, "content")
-            let sbs_found = if str_eq(sbs_label, "session:summary") && str_eq(sbs_type, "SessionSummary") && !str_eq(sbs_content, "") {
-                if str_eq(sbs_found, "") { sbs_content } else { sbs_found }
-            } else { sbs_found }
-            let sbs_i = sbs_i + 1
-        }
-        if str_eq(sbs_found, "") {
-            let sum_fb: String = engram_search_json("SessionSummary previous-session", 2)
-            let sum_fb_ok: Bool = !str_eq(sum_fb, "") && !str_eq(sum_fb, "[]")
-            if sum_fb_ok {
-                let sfn: String = json_array_get(sum_fb, 0)
-                let sftype: String = json_get(sfn, "node_type")
-                let sfcontent: String = json_get(sfn, "content")
-                if str_eq(sftype, "SessionSummary") && !str_eq(sfcontent, "") { sfcontent } else { "" }
-            } else { "" }
-        } else { sbs_found }
-    } else {
-        let sum_fb2: String = engram_search_json("SessionSummary previous-session", 2)
-        let sum_fb2_ok: Bool = !str_eq(sum_fb2, "") && !str_eq(sum_fb2, "[]")
-        if sum_fb2_ok {
-            let sfn2: String = json_array_get(sum_fb2, 0)
-            let sftype2: String = json_get(sfn2, "node_type")
-            let sfcontent2: String = json_get(sfn2, "content")
-            if str_eq(sftype2, "SessionSummary") && !str_eq(sfcontent2, "") { sfcontent2 } else { "" }
-        } else { "" }
-    }
-    let has_prev_sum: String = if str_eq(prev_sum_content, "") { "false" } else { "true" }
-    if !str_eq(prev_sum_content, "") {
-        state_set("soul_prev_session_summary", prev_sum_content)
-        println("[soul] previous session summary loaded (" + int_to_str(str_len(prev_sum_content)) + " chars)")
-    }
-
-
    let payload: String = "{\"event\":\"session_start\""
        + ",\"boot\":" + boot_num
        + ",\"cgi\":\"" + eff_cgi + "\""
        + ",\"node_count\":" + int_to_str(node_ct)
        + ",\"edge_count\":" + int_to_str(edge_ct)
        + ",\"identity_loaded\":" + has_identity
-        + ",\"prev_session_summary_loaded\":" + has_prev_sum
        + ",\"ts\":" + int_to_str(ts) + "}"

    let tags: String = "[\"internal-state\",\"session-start\",\"InternalStateEvent\"]"
-    let discard: String = engram_node_full(
+    let session_event_id: String = engram_node_full(
        payload, "InternalStateEvent", "session-start",
        el_from_float(0.9), el_from_float(0.9), el_from_float(1.0),
        "Episodic", tags
    )
-    println("[soul] session-start event logged (boot=" + boot_num + " nodes=" + int_to_str(node_ct) + " edges=" + int_to_str(edge_ct) + " prev_summary=" + has_prev_sum + ")")
+    if str_eq(session_event_id, "") {
+        println("[soul] emit_session_start_event: engram write failed — session-start event lost")
+    }
+    println("[soul] session-start event logged (boot=" + boot_num + " nodes=" + int_to_str(node_ct) + " edges=" + int_to_str(edge_ct) + ")")
 }

 // layered_cycle — routes user-facing requests through the 4-layer consciousness stack.
 // L0 (core) → L1 (safety screen) → L2a (continuity + behavioral profiling) → L2b (mission alignment) → L3 (imprint) → L1 (safety validate)
 // Internal cognition (heartbeat, proactive, memory ops) bypasses layers — use one_cycle directly.
 fn layered_cycle(raw_input: String) -> String {
+    // conv_history key must match chat.el (conv_history, not conversation_history).
+    // Mismatch caused safety_score_distress_history() to always receive "" - the
+    // history-amplification path in safety_threat_score was permanently dead.
    let history: String = state_get("conv_history")
    let session_id: String = state_get("current_session_id")

@@ -350,8 +305,9 @@ fn layered_cycle(raw_input: String) -> String {
    let screen_result: String = safety_screen(raw_input, history)
    let screen_action: String = json_get(screen_result, "action")

-    // ISSUE 4: safe-mode guard. If safety_screen returned an invalid/empty action
-    // (engram failure or internal error), refuse rather than pass unscreened input.
+    // ISSUE 4: safe-mode guard -- if safety_screen returned invalid/empty action,
+    // refuse the turn rather than silently passing unscreened input to upper layers.
+    // Valid actions: "hard_bell", "soft_bell", "pass". Anything else = corrupt envelope.
    let valid_action: Bool = str_eq(screen_action, "hard_bell")
        || str_eq(screen_action, "soft_bell")
        || str_eq(screen_action, "pass")
@@ -366,8 +322,8 @@ fn layered_cycle(raw_input: String) -> String {
    // history where they could leak context to subsequent turns. They are persisted
    // separately by safety_log_bell() into the Episodic tier with restricted labels.
    //
-    // ISSUE 6: safety_log_bell already called inside safety_screen (line 140).
-    // Do NOT call it again here -- that would double-log every hard bell.
+    // ISSUE 6: safety_log_bell for hard bells is already called INSIDE safety_screen
+    // (safety.el line 140). Do NOT call it again here -- double-log avoided.
    //
    // safety_validate second param: when screen_action is "hard_bell", safety_validate
    // receives the sentinel string "hard_bell" (not a normal screen action). The safety
@@ -409,13 +365,13 @@ fn layered_cycle(raw_input: String) -> String {
        json_get(steward_result, "redirect_to")
    }

-    // ISSUE 1: pre-LLM bell augmentation for layered_cycle path.
-    // safety_augment_system appends soft/hard directive to system prompt when bell fires,
-    // ensuring LLM processes message WITH the safety directive -- not just post-output gate.
-    // Stored in state as "layered_cycle_safety_system_addendum" for imprint_respond to use.
-    // TODO: wire directly when imprint_respond gains system_override param (imprint.el change).
-    // ISSUE 3 TODO: no semantic crisis detection. Keyword-only means signals that evade
-    // the phrase list pass with zero augmentation. Semantic layer = separate decision.
+    // ISSUE 1: apply pre-LLM bell augmentation on layered_cycle path.
+    // safety_augment_system injects soft/hard directive into system prompt before LLM call.
+    // Stored in state so imprint_respond can consume it.
+    // TODO: wire directly into imprint_respond when it accepts a system_override param.
+    // ISSUE 3 TODO: no semantic/embedding crisis detection. Keyword-only means signals
+    // evading the phrase list pass through with zero augmentation. Semantic layer is a
+    // separate architectural decision requiring embedding inference on every message.
    let augmented_addendum: String = safety_augment_system("", raw_input)
    state_set("layered_cycle_safety_system_addendum", augmented_addendum)

@@ -458,12 +414,29 @@ let snapshot_usable: Bool = local_node_count > 50

 if using_http_engram && !snapshot_usable {
    // First boot or empty/corrupt snapshot: seed from HTTP Engram.
+    // Retry up to 3 times (2s sleep between attempts) to guard against a
+    // transient network hiccup right after entrypoint.sh health check passes.
+    // An empty nodes response silently loads a zero-node graph; validate first.
+    // TODO(reliability): replace sleep_ms retry with non-blocking backoff.
    println("[soul] engram -> HTTP " + engram_url_raw + " (no local snapshot, first boot)")
-    let nodes_json: String = http_get(engram_url_raw + "/api/nodes?limit=10000")
-    let edges_json: String = http_get(engram_url_raw + "/api/edges")
-    let nodes_part: String = if str_eq(nodes_json, "") { "[]" } else { nodes_json }
-    let edges_part: String = if str_eq(edges_json, "") { "[]" } else { edges_json }
-    let snapshot_data: String = "{\"nodes\":" + nodes_part + ",\"edges\":" + edges_part + "}"
+    let fetch_attempt: Int = 0
+    while fetch_attempt < 3 {
+        let fetch_attempt = fetch_attempt + 1
+        let n: String = http_get(engram_url_raw + "/api/nodes?limit=10000")
+        let e: String = http_get(engram_url_raw + "/api/edges")
+        let nodes_ok: Bool = !str_eq(n, "") && str_starts_with(n, "[") && str_len(n) > 2
+        if nodes_ok {
+            state_set("_boot_nodes_json", n)
+            state_set("_boot_edges_json", e)
+            let fetch_attempt = 3
+        } else {
+            println("[soul] boot HTTP fetch attempt " + int_to_str(fetch_attempt) + " failed --- retrying in 2s")
+            sleep_ms(2000)
+        }
+    }
+    let nodes_json: String = state_get("_boot_nodes_json")
+    let edges_json: String = state_get("_boot_edges_json")
+        let snapshot_data: String = "{\"nodes\":" + nodes_part + ",\"edges\":" + edges_part + "}"
    let tmp_path: String = "/tmp/soul-engram-" + soul_cgi_id + ".json"
    fs_write(tmp_path, snapshot_data)
    engram_load(tmp_path)
Author	SHA1	Message	Date
will.anderson	978a6812d7	fix(recall): address all remaining code review issues Issue 1 (CRITICAL): Fix auto_persist brace structure. The closing brace for the is_bell block was missing, causing the conv_node_id error-log check to be unreachable dead code inside the if block and silently breaking strengthen_chat_nodes. Add the missing } to close the is_bell block before the conv_node_id guard. Issue 2 (CRITICAL): Restore session_exists() call in handle_chat_agentic. The behavioral regression replacing session_exists() with !str_contains(session_get(...), '"error"') was reverted. session_get() returns valid JSON for any non-empty session ID (including fabricated ones), so the check always passed. session_exists() does a proper state-index and engram search. Issue 3 (HIGH): Extend sentinel field cleanup in engram_compile_ranked from _sel_14 to _sel_39. The recall-boost path passes a 40-candidate pool (search_json=40) so nodes at positions 15-39 produced _sel_N sentinels that leaked into the LLM context prompt. Cleanup chain now covers all 40 indices. Issue 4 (HIGH): Fix engram_is_continuation false positives. Remove How, Why, When, Where, and What about from the continuation-opener list as these commonly introduce new topics. Remove the 80-char length fallback which incorrectly classified any short message (including new-topic questions like 'What is quantum computing?') as a continuation. Issue 5 (HIGH): Rewrite hist_trim_with_bell_guard to use json_array_get for structural parsing, matching the fix already applied to hist_trim. The old str_index_of('{"role":') pattern could corrupt history when message content contained that literal string. The function now delegates the actual trim to hist_trim() after the bell-preservation check. Issue 6 (NORMAL): Fix entity_count scoping in engram_extract_entities. Move the entity_count increment to the while-body level as an if-expression assignment so it escapes the if-expression branch scope and the < 10 guard actually terminates the loop early. Issue 7 (NORMAL): Fix mcp_call_seq race in call_mcp_bridge. Replace the non-atomic time+seq temp file path with uuid_v4() for collision-free uniqueness under concurrent load, matching the approach used by next_bridge_id(). Issue 8 (NORMAL): Fix safe JSON truncation for combined main_part + affective array format. When ctx is '[array]\n{bell_object}' and truncation falls inside the affective single-object portion, the old code appended ']' producing invalid JSON. Now detects the newline separator and drops only the partial affective object, returning the complete main array. Issue 9 (NORMAL): Handle 4th+ topics in engram_compile. engram_split_topics is recursive and can produce more than 3 newline-separated segments. Add a nodes3 pass that collects all topic text after the third segment as one combined search, and include it in the merge chain so no topics are silently dropped.	2026-06-22 13:36:41 -05:00
will.anderson	18e040acb1	feat(recall): recall-completeness improvements Neuron Soul CI / build (pull_request) Has been cancelled Details - Lower engram_compile_ranked threshold 25->15: include moderately-relevant older nodes - Extend sentinel cleanup from _sel_9 to _sel_14 to prevent JSON noise - Add engram_split_topics for multi-topic decomposition (AND/and/also/plus) - Add engram_extract_entities for named entity dedicated searches - Add engram_detect_recall_intent for boosted 40-candidate search on recall phrases - Add engram_is_continuation replacing brittle 50-char threshold (now 80 + pronoun/opener detection) - Add engram_compile_multi with depth 8 (was 5) and 30-candidate search pool - Add engram_nodes_merge for clean two-array deduplication - Replace engram_compile with multi-topic/entity/recall-boost version; budget 6000->8000 - Safe JSON truncation: scan for last } before budget cap instead of raw str_slice - handle_chat and agentic_chat: use engram_is_continuation; thread snip 150->250 - session_preload: add project-status and session-summary search queries	2026-06-22 13:11:06 -05:00
will.anderson	6edf9937dd	fix(reliability): LLM retry Neuron Soul CI / build (pull_request) Has been cancelled Details	2026-06-22 12:37:29 -05:00
will.anderson	e447a87a00	fix(reliability): route error recovery	2026-06-22 12:37:21 -05:00
will.anderson	575ff1329a	fix(reliability): engram connection	2026-06-22 12:34:04 -05:00
will.anderson	db33b0cb91	fix(reliability): engram write	2026-06-22 12:32:59 -05:00
will.anderson	f35569d4bb	fix(reliability): cross-session affective state	2026-06-22 12:31:09 -05:00
will.anderson	94b71b6e6b	fix(reliability): conversation history	2026-06-22 12:29:23 -05:00
will.anderson	392d2416ec	fix(reliability): replace undefined session_exists with session_get check Neuron Soul CI / build (pull_request) Failing after 13m25s Details	2026-06-22 12:21:31 -05:00
will.anderson	2865d6ad26	fix(reliability): route-error-recovery Neuron Soul CI / build (pull_request) Has been cancelled Details - Issue #3: err_404/err_405 now emit HTTP 404/405 via __status__ envelope instead of HTTP 200 - Issue #4: add auth_check() function to handle_request; enforces NEURON_TOKEN on all routes except /health and /lineage - Issue #5: missing required params now return HTTP 400 (__status__ envelope) in /api/chat (GET+POST), /imprint/contextual, /imprint/user, and handle_chat - Issue #6: LLM unavailable in handle_chat now returns HTTP 503 instead of HTTP 200 - Issue #7: add 32 KB message size guard on POST /api/chat before engram_compile and LLM - Issue #8: add TODO comment to route_health documenting the live-engram-query problem and the /health/deep split plan - Issue #9: add comment to hist_trim documenting fragile str_index_of parser and silent data corruption risk - Issue #10: add TODO comment in handle_request documenting missing per-IP rate limiting - Issue #11: fix connectd_post temp file collision — add monotonic sequence counter so concurrent requests get unique paths - Issue #12: fix call_mcp_bridge fixed temp file race — add monotonic sequence counter for unique paths under concurrent load - Issues #1/#2: add TODO comment in handle_request documenting EL no-exception limitation and SIGSEGV handler gap	2026-06-22 12:00:06 -05:00
will.anderson	47d0e6f985	fix(reliability): llm-retry — empty response detection, configurable max_tokens, connector timeout Neuron Soul CI / build (pull_request) Failing after 11m16s Details Issue #5: detect empty string from llm_extract_text() as an error in handle_chat, handle_chat_as_soul, and handle_dharma_room_turn. The C runtime silently returns "" when the LLM response content array is missing or all blocks fail to parse; without this guard the empty string passes through to callers as a silent empty reply. Issue #9: make agentic_loop max_tokens configurable via NEURON_LLM_MAX_TOKENS env var (default 4096). The hardcoded value is marginal for long tool chains (8 iterations x 4096 tokens); operators can now set 8192+ for complex multi-step tasks without rebuilding. Non-agentic path (llm_call_system) still uses the C runtime hardcode — that fix lives in el_runtime.c (see TODO block added in this commit). Issue #10: increase connector_tools_json and tool_auto_approved curl --max-time from 2s to 5s to reduce false-empty tool lists when neuron-connectd is under transient load. Graceful degradation to [] on bridge down is unchanged. Issues #1/#2/#3/#4/#6/#8: documented as TODO comments in chat.el. These require targeted C runtime changes in el_runtime.c (llm_provider_request retry loop, EL_LLM_TIMEOUT_MS separation, HTTP 429 backoff, 5xx retry, EL_HTTP_MAX_RESPONSE_BYTES cap). Architectural decisions recorded so they are traceable to root causes.	2026-06-22 11:59:43 -05:00
will.anderson	d008649c3e	fix(reliability): engram-connection Neuron Soul CI / build (pull_request) Has been cancelled Details - entrypoint.sh: extend engram health-check timeout 30->60s; set EL_HTTP_TIMEOUT_MS=10000 and EL_HTTP_CONNECT_TIMEOUT_MS=3000 to bound awareness loop blocking window to 10s/call (down from 60s default) - soul.el: 3-attempt retry loop for boot-time /api/nodes+/api/edges fetch; validate non-empty JSON array before loading to prevent silent zero-node identity graph from transient post-healthcheck network hiccup - awareness.el: soft circuit-breaker in ise_post (opens after 3 failures, 30s backoff, half-open probe); /api/sync refresh skips HTTP call when breaker is open; error-JSON detection on sync response TODOs: full async dispatch, connection pooling (require EL futures/persistent curl)	2026-06-22 11:57:20 -05:00
will.anderson	aa70c5dde6	fix(reliability): safety-resilience — bell augmentation, safe mode, dedup logging, tab escaping, handle_chat coverage	2026-06-22 11:54:40 -05:00
will.anderson	deddb9a18e	fix(reliability): safety-resilience — bell augmentation, safe mode, dedup logging, tab escaping, handle_chat coverage	2026-06-22 11:53:07 -05:00
will.anderson	494d973a3b	fix(reliability): engram-write — guard all fire-and-forget writes Neuron Soul CI / build (pull_request) Has been cancelled Details Every engram_node_full call that dropped its return value now binds it and emits a println on empty string. engram_save calls in consolidate, heartbeat, and dharma-room-turn are checked for failure. The two API handlers (log_state_event, tune_config) that skipped api_persisted() now match the read-back-after-write contract used everywhere else in neuron-api.el. Files changed: - chat.el: conv_history_persist, handle_dharma_room_turn, auto_persist - soul.el: emit_session_start_event, seed_persona_from_env HTTP check - memory.el: mem_save, mem_boot_count_inc - neuron-api.el: handle_api_log_state_event, handle_api_tune_config, handle_api_consolidate (engram_save + session summary write) - awareness.el: ise_post local-engram fallback path TODO comments added for non-atomic patterns (issues #12, #13) and the missing circuit breaker (#14) — these require new primitives.	2026-06-22 11:48:59 -05:00
will.anderson	34551695a1	fix(reliability): cross-session-affective Neuron Soul CI / build (pull_request) Has been cancelled Details - Fix state key mismatch: soul.el layered_cycle now reads conv_history (not conversation_history), unblocking the safety_score_distress_history history-amplification path in safety_threat_score - Add safety_augment_system call on the main handle_chat path so the phrase-list bell detector fires on all chat turns, not just dharma rooms - Add cross-session affective engram query in load_identity_context() at boot; stores distress/crisis signals from prior sessions under soul_affective_context with a 7-day soft recency filter	2026-06-22 11:48:30 -05:00
will.anderson	615f0cee08	fix(reliability): conv-history — asymmetric load, silent failures, broken trim, agentic gap Neuron Soul CI / build (pull_request) Has been cancelled Details Issues addressed: - #1 ASYMMETRIC PERSIST/LOAD: conv_history_load() now tries engram_get_node_by_label() first (symmetric with the label-based write), falling back to vector search only when label lookup returns nothing. Immune to cold/corrupt vector index. - #2 SILENT LOAD FAILURE: all failure paths in conv_history_load() and conv_history_persist() now emit a println log line rather than silently returning "" or dropping writes. - #3 NO RECOVERY PATH: documented as TODO with explanation of why a full recovery path (retry, ID fallback, orphan cleanup) is too invasive for a targeted fix here. - #4 OVERWRITE WITHOUT DELETE: documented with TODO to replace engram_node_full with explicit delete-then-create once engram exposes a label-scoped delete API. - #5/#10 BROKEN TRIM / OFF-BY-ONE: hist_trim() rewritten to use json_array_len / json_array_get (structural JSON ops) instead of raw str_index_of scanning for '{"role":' markers. Immune to marker strings appearing inside message content. Minimum retained count guard added: never trims below 2 entries. - #6 PARTIAL-WRITE GUARD: conv_history_persist() refuses to write a blob that doesn't contain both '[' and ']'. conv_history_load() requires both before accepting content. - #7 DUAL STORAGE: documented with a comment at the persist call site. - #8 NO MAX SIZE GUARD: documented as TODO with rationale for why a byte-length cap requires a more invasive change (entry truncation or summarisation). - #9 AGENTIC HISTORY NOT PERSISTED: handle_chat_agentic() now calls conv_history_persist() for the default global session (hist_key == "conv_history") after updating state, matching the non-agentic path's durability. Named sessions remain in-process only.	2026-06-22 11:46:00 -05:00