perf: 81% RSS reduction in elc compiler #4

2026-05-06T01:40:01Z

will.anderson commented

2026-05-06 01:40:01 +00:00

Summary

Seven rounds of swarm optimization on the El compiler, applied as a clean branch from `add-linux-binaries` (eb52be4).

Flat stride-2 token list — lexer emits `[kind0, val0, kind1, val1, ...]` instead of `[{kind, val}, ...]`, eliminating per-token ElMap allocation (~112B × N tokens)
str_char_code hot loop — character classification via Int codes in lexer, no strdup per character
Batch c_escape — `str_slice` clean ASCII runs instead of per-byte `str_char_at`; only special bytes go through the append path
Systematic el_release() — eagerly frees intermediate parse result maps throughout parser.el after all fields extracted; containers freed as soon as consumed
Streaming codegen pipeline — `codegen_streaming()` parses one declaration at a time, emits C, discards AST; peak memory is O(one function) instead of O(whole program)
Per-function and per-statement arena scoping — `el_arena_push/pop` around each compile unit and each cg_stmt(); intermediate codegen strings freed at boundaries
HAVE_CURL guard — curl-dependent HTTP/LLM/OTLP/Dharma functions behind `#ifdef HAVE_CURL`; elc CLI links without -lcurl, eliminating libcurl SSL/TLS init overhead (~27MB saved)
HTML codegen parts-list — O(n) instead of O(n²) string growth for `cg_html_parts`, `cg_html_attrs_str`, `cg_html_element_str`

Results

Metric	Before	After	Change
RSS (web/src/main.el)	33.4MB	6.5MB	-80.5%
Binary size	452KB	318KB	-29.7%
Self-host	PASS	PASS	—

Test plan

Self-host: gen2.c == gen3.c (verified during development: PASS)
Benchmark: RSS < 7MB on web/src/main.el (measured: 6.5MB)
Compile time: ≤ 250ms (measured: ~210ms)
Bootstrap binary updated in lang/dist/platform/elc
Build with -DHAVE_CURL for production deployments needing HTTP/LLM

## Summary Seven rounds of swarm optimization on the El compiler, applied as a clean branch from \`add-linux-binaries\` (eb52be4). - **Flat stride-2 token list** — lexer emits \`[kind0, val0, kind1, val1, ...]\` instead of \`[{kind, val}, ...]\`, eliminating per-token ElMap allocation (~112B × N tokens) - **str_char_code hot loop** — character classification via Int codes in lexer, no strdup per character - **Batch c_escape** — \`str_slice\` clean ASCII runs instead of per-byte \`str_char_at\`; only special bytes go through the append path - **Systematic el_release()** — eagerly frees intermediate parse result maps throughout parser.el after all fields extracted; containers freed as soon as consumed - **Streaming codegen pipeline** — \`codegen_streaming()\` parses one declaration at a time, emits C, discards AST; peak memory is O(one function) instead of O(whole program) - **Per-function and per-statement arena scoping** — \`el_arena_push/pop\` around each compile unit and each cg_stmt(); intermediate codegen strings freed at boundaries - **HAVE_CURL guard** — curl-dependent HTTP/LLM/OTLP/Dharma functions behind \`#ifdef HAVE_CURL\`; elc CLI links without -lcurl, eliminating libcurl SSL/TLS init overhead (~27MB saved) - **HTML codegen parts-list** — O(n) instead of O(n²) string growth for \`cg_html_parts\`, \`cg_html_attrs_str\`, \`cg_html_element_str\` ## Results | Metric | Before | After | Change | |--------|--------|-------|--------| | RSS (web/src/main.el) | 33.4MB | 6.5MB | -80.5% | | Binary size | 452KB | 318KB | -29.7% | | Self-host | PASS | PASS | — | ## Test plan - [ ] Self-host: gen2.c == gen3.c (verified during development: PASS) - [ ] Benchmark: RSS < 7MB on web/src/main.el (measured: 6.5MB) - [ ] Compile time: ≤ 250ms (measured: ~210ms) - [ ] Bootstrap binary updated in lang/dist/platform/elc - [ ] Build with -DHAVE_CURL for production deployments needing HTTP/LLM

will.anderson added 6 commits 2026-05-06 01:40:02 +00:00

beta: replace native_string_chars with str_char_at/str_slice in lexer — 49% memory reduction on large files 2ac11a67b1

round-2-alpha: char code ops in lex() hot loop — eliminate str_char_at allocations 1e67544c88

Replace str_char_at (returns strdup String) with str_char_code (returns Int)
in the main lex() while loop and scan_digits/scan_ident helpers.

For a 400KB combined source, str_char_at was allocating ~400K x 16B = 6.4MB
of transient 2-byte strings for the ch variable alone. str_char_code returns
an integer directly — zero allocation.

Add Int-based helpers: is_digit_code, is_alpha_code, is_ws_code,
is_alnum_or_underscore_code. Rewrite lex() operator dispatch using char
code constants (e.g. '/'=47, '"'=34, '='=61).

Result on main.el: 17.1MB -> 15.4MB peak RSS (-10%).
Self-hosting: PASS.

round-2-gamma: combine flat token list + char code dispatch — max round-2 savings 1eef9928f4

Combines two orthogonal optimizations:
1. Flat token list (from beta): lex() returns [Any] with alternating kind/value
   pairs instead of [Map], eliminating one ElMap per token (~3 mallocs each).
   Parser updated: tok_kind(t,i) = t[2*i], tok_value(t,i) = t[2*i+1].

2. Char code dispatch (from alpha): lex() hot loop uses str_char_code -> Int
   instead of str_char_at -> strdup String for all character classification.
   Eliminates ~400K x 16B = 6.4MB of temporary string allocations.

scan_digits and scan_ident also updated to use str_char_code.

Result on main.el: 17.1MB -> 14.4MB peak RSS (-16%).
Self-hosting: PASS.

round-3-gamma: combine c_escape + scan_interp_string batching — max round-3 savings e587bedf30

Combines two orthogonal optimizations:
1. c_escape batching (from alpha): ASCII runs emitted as str_slice segments instead
   of one str_char_at string per byte. O(N) allocs → O(K) where K = special chars.

2. scan_interp_string batching (from beta): char dispatch via str_char_code (Int)
   + clean_start tracking to flush plain runs as str_slice. Eliminates per-char
   string allocations in the string-literal scanning hot path.

Result on web/src/main.el: 14.5MB -> 13.4MB peak RSS (-7.6%).
Self-hosting: PASS.

merge round-4-delta: flat stride-2 token list + str_char_code dispatch + batch c_escape ee86736eab

- Flat token list: lexer emits [kind0, val0, kind1, val1, ...] instead of [{kind,val}, ...]
  Eliminates per-token ElMap allocation (~112B × N tokens)
- str_char_code hot loop: char classification via Int codes, no strdup per char
- Batch c_escape: str_slice clean runs instead of char-at per byte
- Parser updated to use tok_at/tok_kind/tok_value stride-2 accessors

perf: 81% RSS reduction — el_release, arena scoping, streaming codegen, libcurl stub 3726f69435

Chain of optimizations from swarm rounds 4-7:
- Flat stride-2 token list: eliminate per-token Map allocation (~112B each × N tokens)
- Systematic el_release() in parser.el: eagerly free intermediate parse result maps
- Per-function and per-statement arena scoping in codegen_streaming()
- Streaming codegen pipeline: parse one fn at a time, emit C, discard AST
- HAVE_CURL guard: elc CLI binary drops libcurl, eliminating SSL/TLS init overhead
- HTML codegen parts-list: O(n) instead of O(n²) string growth for nested templates
- Batch c_escape: str_slice clean runs instead of char-at per byte

Result: 33.4MB → 6.5MB RSS on web/src/main.el (-81%). Self-host: PASS.

will.anderson closed this pull request

2026-05-06 19:36:58 +00:00

will.anderson referenced this issue from a commit

2026-05-07 00:23:53 +00:00

Merge PR #4: perf: 81% RSS reduction in elc compiler + --test mode

Pull request closed

Please reopen this pull request to perform a merge.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: neuron-technologies/el#4