wnf_at is not thread-safe on shared cells
Under -T>=2 HVM4 returns different normal forms on different runs
for programs that share reducible subgraphs across multiple consumers.
The interaction calculus is confluent, so this is an implementation
bug in the C runtime, not a calculus-level issue.
Minimal reproduction
// repro.hvm
@name = "POLY_INPUT"
@main = #P{%env(@name), %env(@name)}
Build the runtime and run:
$ cd clang && clang -O2 -o main main.c
$ env POLY_INPUT=val1 ./main -C1 -T1 repro.hvm
#P{#OK{"val1"},#OK{"val1"}} # deterministic
$ for i in 1 2 3 4 5 6 7 8; do
env POLY_INPUT=val1 ./main -C1 -T4 repro.hvm
done
#P{#ERR{"ERROR(env): invalid `name`; expected #NIL or #CON(#CHR{NUM}, tail)"}, #OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#ERR{"ERROR(env): invalid `name`; ..."},#ERR{"ERROR(env): invalid `name`; ..."}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#ERR{"ERROR(env): invalid `name`; ..."},#OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}
Roughly 50% failure rate; also non-deterministic failure pattern
(sometimes left, sometimes right, sometimes both).
The empirical signature is consistent: the primitive %env's walker
chain (env_go_name → env_go_chr → env_go_num → env_go_io) observes
the character list in an inconsistent state — truncated or with BJV
placeholders at interior positions — and returns invalid \name``.
Root cause
The hot path in clang/wnf/_.c:831-858:
fn Term wnf_at(u64 loc) {
Term cur = heap_read(loc); // (1) atomic load
switch (term_tag(cur)) {
/* WNF-already cases ... */
case C00 ... C16: return cur;
...
}
Term res = wnf(cur); // (2) reduce off-heap
if (res != cur) {
heap_set(loc, res); // (3) atomic store
}
return res;
}
Steps (1) and (3) are individually atomic, but the (read → reduce → write-back) triple is not. Two worker threads both enter
wnf_at(loc) with the same cur, both reduce it independently to
equivalent res values, both write back. Individually each write is
a valid result (confluence still holds), but consumers that read the
heap cells pointed to by res now see heap locations that were freshly
allocated by one thread and have already been rewritten by the other's
reduction, producing BJV/DP0/DP1 intermediates escaping into
primitive frames.
Compare the DP0/DP1 path on the same hot loop (clang/wnf/_.c:108)
which does use heap_take:
case DP0:
case DP1: {
u64 loc = term_val(next);
Term cell = heap_take(loc); // atomic exchange: takes the cell
...
}
heap_take (clang/heap/take.c:11 — __atomic_exchange_n ... ACQUIRE)
gives one thread exclusive ownership of the cell for the duration of
the reduction. The other thread sees cell == 0 and is expected to
wait / retry. Every reduction step that commits a result back to
the heap needs this discipline, not just DP0/DP1.
Why user-level fixes don't help
- Wrapping an argument in
!!x = val; body (strict let, clang/parse/term/dup.c:74-76)
forces only the top-level WNF — not the char cells underneath.
The sub-cells still race in wnf_at.
- Rebuilding the argument through a recursive strict fold (the same
pattern used by @force_poly_bytes in callers' runtime preambles)
does visit every sub-cell, but each visit still goes through the
racy wnf_at path — at -T>=4 the fold itself deadlocks because
two workers' write-backs to the same cell contend infinitely.
- Putting a
pthread_mutex around a primitive doesn't help because
HVM4 dispatches sub-primitives from worker threads independent of
the caller's frame; the race is between the primitive's internal
wnf calls and other workers reducing the same shared subgraph.
- Calling
eval_normalize inside a primitive (clang/eval/normalize.c)
spawns its own pthread pool, which compounds the same race
because the inner SNF workers hit the same non-atomic wnf_at.
Proposed direction
Make every cache write-back in the reducers atomic with respect to
concurrent reads of the same cell. Concretely:
wnf_at (and the analogous cache steps at the end of interaction
rules in clang/wnf/*.c) should heap_take the cell before
reducing and write the result back through heap_set_rel (release-
ordered store already exists as clang/heap/set_rel.c). When a
second thread sees the cell 0, it either spins on heap_read
until the writer publishes the result, or re-enters the original
term through the ALO mechanism.
- Every reduction rule that currently does
... = wnf(x); heap_set(loc, ...)
needs the same lock-free discipline. A grep for heap_set inside
clang/wnf/*.c is the full patch surface (≈20 sites).
- Regression test: run the repro above for
-T1..-T8 in the test
harness and require byte-identical output.
The edits are mechanical but dense; the tricky part is picking the
right memory-order between the heap_take / heap_set_rel pair and
avoiding starvation when many workers hit the same cell. A reader-
waits-on-publication scheme parallels the existing term_sub_get /
term_sub_set SUB-bit protocol used in the DP0/DP1 path.
Scope
The fix is internal to clang/wnf/ and clang/heap/. No changes
needed to the language surface, the parser, or any primitive's
semantics. Single-threaded performance should not regress
appreciably: heap_take has identical cost to heap_read + first
write on the fast path.
Happy to begin implementing this if a maintainer 👍
wnf_atis not thread-safe on shared cellsUnder
-T>=2HVM4 returns different normal forms on different runsfor programs that share reducible subgraphs across multiple consumers.
The interaction calculus is confluent, so this is an implementation
bug in the C runtime, not a calculus-level issue.
Minimal reproduction
Build the runtime and run:
Roughly 50% failure rate; also non-deterministic failure pattern
(sometimes left, sometimes right, sometimes both).
The empirical signature is consistent: the primitive
%env's walkerchain (
env_go_name → env_go_chr → env_go_num → env_go_io) observesthe character list in an inconsistent state — truncated or with
BJVplaceholders at interior positions — and returns
invalid \name``.Root cause
The hot path in
clang/wnf/_.c:831-858:Steps (1) and (3) are individually atomic, but the
(read → reduce → write-back)triple is not. Two worker threads both enterwnf_at(loc)with the samecur, both reduce it independently toequivalent
resvalues, both write back. Individually each write isa valid result (confluence still holds), but consumers that read the
heap cells pointed to by
resnow see heap locations that were freshlyallocated by one thread and have already been rewritten by the other's
reduction, producing
BJV/DP0/DP1intermediates escaping intoprimitive frames.
Compare the
DP0/DP1path on the same hot loop (clang/wnf/_.c:108)which does use
heap_take:heap_take(clang/heap/take.c:11—__atomic_exchange_n ... ACQUIRE)gives one thread exclusive ownership of the cell for the duration of
the reduction. The other thread sees
cell == 0and is expected towait / retry. Every reduction step that commits a result back to
the heap needs this discipline, not just
DP0/DP1.Why user-level fixes don't help
!!x = val; body(strict let,clang/parse/term/dup.c:74-76)forces only the top-level WNF — not the char cells underneath.
The sub-cells still race in
wnf_at.pattern used by
@force_poly_bytesin callers' runtime preambles)does visit every sub-cell, but each visit still goes through the
racy
wnf_atpath — at-T>=4the fold itself deadlocks becausetwo workers' write-backs to the same cell contend infinitely.
pthread_mutexaround a primitive doesn't help becauseHVM4 dispatches sub-primitives from worker threads independent of
the caller's frame; the race is between the primitive's internal
wnfcalls and other workers reducing the same shared subgraph.eval_normalizeinside a primitive (clang/eval/normalize.c)spawns its own pthread pool, which compounds the same race
because the inner SNF workers hit the same non-atomic
wnf_at.Proposed direction
Make every cache write-back in the reducers atomic with respect to
concurrent reads of the same cell. Concretely:
wnf_at(and the analogous cache steps at the end of interactionrules in
clang/wnf/*.c) shouldheap_takethe cell beforereducing and write the result back through
heap_set_rel(release-ordered store already exists as
clang/heap/set_rel.c). When asecond thread sees the cell
0, it either spins onheap_readuntil the writer publishes the result, or re-enters the original
term through the ALO mechanism.
... = wnf(x); heap_set(loc, ...)needs the same lock-free discipline. A grep for
heap_setinsideclang/wnf/*.cis the full patch surface (≈20 sites).-T1..-T8in the testharness and require byte-identical output.
The edits are mechanical but dense; the tricky part is picking the
right memory-order between the
heap_take/heap_set_relpair andavoiding starvation when many workers hit the same cell. A reader-
waits-on-publication scheme parallels the existing
term_sub_get/term_sub_setSUB-bit protocol used in theDP0/DP1path.Scope
The fix is internal to
clang/wnf/andclang/heap/. No changesneeded to the language surface, the parser, or any primitive's
semantics. Single-threaded performance should not regress
appreciably:
heap_takehas identical cost toheap_read+ firstwrite on the fast path.
Happy to begin implementing this if a maintainer 👍