Skip to content

wnf_at is not thread-safe on shared cells #57

Description

@o1lo01ol1o

wnf_at is not thread-safe on shared cells

Under -T>=2 HVM4 returns different normal forms on different runs
for programs that share reducible subgraphs across multiple consumers.
The interaction calculus is confluent, so this is an implementation
bug in the C runtime, not a calculus-level issue.

Minimal reproduction

// repro.hvm
@name = "POLY_INPUT"
@main = #P{%env(@name), %env(@name)}

Build the runtime and run:

$ cd clang && clang -O2 -o main main.c

$ env POLY_INPUT=val1 ./main -C1 -T1 repro.hvm
#P{#OK{"val1"},#OK{"val1"}}                 # deterministic

$ for i in 1 2 3 4 5 6 7 8; do
    env POLY_INPUT=val1 ./main -C1 -T4 repro.hvm
  done
#P{#ERR{"ERROR(env): invalid `name`; expected #NIL or #CON(#CHR{NUM}, tail)"}, #OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#ERR{"ERROR(env): invalid `name`; ..."},#ERR{"ERROR(env): invalid `name`; ..."}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#ERR{"ERROR(env): invalid `name`; ..."},#OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}
#P{#OK{"val1"},#OK{"val1"}}

Roughly 50% failure rate; also non-deterministic failure pattern
(sometimes left, sometimes right, sometimes both).

The empirical signature is consistent: the primitive %env's walker
chain (env_go_name → env_go_chr → env_go_num → env_go_io) observes
the character list in an inconsistent state — truncated or with BJV
placeholders at interior positions — and returns invalid \name``.

Root cause

The hot path in clang/wnf/_.c:831-858:

fn Term wnf_at(u64 loc) {
  Term cur = heap_read(loc);        // (1) atomic load
  switch (term_tag(cur)) {
    /* WNF-already cases ... */
    case C00 ... C16: return cur;
    ...
  }
  Term res = wnf(cur);              // (2) reduce off-heap
  if (res != cur) {
    heap_set(loc, res);             // (3) atomic store
  }
  return res;
}

Steps (1) and (3) are individually atomic, but the (read → reduce → write-back) triple is not. Two worker threads both enter
wnf_at(loc) with the same cur, both reduce it independently to
equivalent res values, both write back. Individually each write is
a valid result (confluence still holds), but consumers that read the
heap cells pointed to by res now see heap locations that were freshly
allocated by one thread and have already been rewritten by the other's
reduction, producing BJV/DP0/DP1 intermediates escaping into
primitive frames.

Compare the DP0/DP1 path on the same hot loop (clang/wnf/_.c:108)
which does use heap_take:

case DP0:
case DP1: {
  u64 loc = term_val(next);
  Term cell = heap_take(loc);       // atomic exchange: takes the cell
  ...
}

heap_take (clang/heap/take.c:11__atomic_exchange_n ... ACQUIRE)
gives one thread exclusive ownership of the cell for the duration of
the reduction. The other thread sees cell == 0 and is expected to
wait / retry. Every reduction step that commits a result back to
the heap needs this discipline, not just DP0/DP1.

Why user-level fixes don't help

  • Wrapping an argument in !!x = val; body (strict let, clang/parse/term/dup.c:74-76)
    forces only the top-level WNF — not the char cells underneath.
    The sub-cells still race in wnf_at.
  • Rebuilding the argument through a recursive strict fold (the same
    pattern used by @force_poly_bytes in callers' runtime preambles)
    does visit every sub-cell, but each visit still goes through the
    racy wnf_at path — at -T>=4 the fold itself deadlocks because
    two workers' write-backs to the same cell contend infinitely.
  • Putting a pthread_mutex around a primitive doesn't help because
    HVM4 dispatches sub-primitives from worker threads independent of
    the caller's frame; the race is between the primitive's internal
    wnf calls and other workers reducing the same shared subgraph.
  • Calling eval_normalize inside a primitive (clang/eval/normalize.c)
    spawns its own pthread pool, which compounds the same race
    because the inner SNF workers hit the same non-atomic wnf_at.

Proposed direction

Make every cache write-back in the reducers atomic with respect to
concurrent reads of the same cell. Concretely:

  1. wnf_at (and the analogous cache steps at the end of interaction
    rules in clang/wnf/*.c) should heap_take the cell before
    reducing and write the result back through heap_set_rel (release-
    ordered store already exists as clang/heap/set_rel.c). When a
    second thread sees the cell 0, it either spins on heap_read
    until the writer publishes the result, or re-enters the original
    term through the ALO mechanism.
  2. Every reduction rule that currently does ... = wnf(x); heap_set(loc, ...)
    needs the same lock-free discipline. A grep for heap_set inside
    clang/wnf/*.c is the full patch surface (≈20 sites).
  3. Regression test: run the repro above for -T1..-T8 in the test
    harness and require byte-identical output.

The edits are mechanical but dense; the tricky part is picking the
right memory-order between the heap_take / heap_set_rel pair and
avoiding starvation when many workers hit the same cell. A reader-
waits-on-publication scheme parallels the existing term_sub_get /
term_sub_set SUB-bit protocol used in the DP0/DP1 path.

Scope

The fix is internal to clang/wnf/ and clang/heap/. No changes
needed to the language surface, the parser, or any primitive's
semantics. Single-threaded performance should not regress
appreciably: heap_take has identical cost to heap_read + first
write on the fast path.

Happy to begin implementing this if a maintainer 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions