Skip to content

Bugfixes, benchmarks and improvements to FlatMap#1882

Merged
kennyweiss merged 28 commits into
developfrom
feature/kweiss/flatmap-improvements
Jun 18, 2026
Merged

Bugfixes, benchmarks and improvements to FlatMap#1882
kennyweiss merged 28 commits into
developfrom
feature/kweiss/flatmap-improvements

Conversation

@kennyweiss

@kennyweiss kennyweiss commented Jun 11, 2026

Copy link
Copy Markdown
Member

Summary

  • This PR adds some bugfixes and performance improvements to axom::FlatMap
  • It also adds an initial benchmark suite for FlatMap against std::unordered_map, google sparsehash and std::map
  • Bugfixes:
    • There were some bugs related to truncating hashes to 32 bits (when IndexType is 32 bits), and in casting from float to int, and in using operator[] on const maps and in the copy-assign operator.
  • Optimizations
    • Since the hashes are powers of 2, we can use bitmasks rather than mod (%)
    • Specialized batch insertion for sequential exec policy, where we don't need to worry about synchronization

Results/comparisons

Serial benchmark results using a RelWithDebInfo config (lower is better)

We now get roughly comparable or better results in serial -- compared to std::unordered_map, std::map and our vendored google sparsehash. FlatMap is our default hash function, FlatMapFastHash is a different hash function that appears to be somewhat faster.

Hashing 32K pairs ($2^{15}$)

image

Hashing 1M pairs ($2^{20}$)

image

Serial vs. OMP vs GPU

This branch has some modest speedups vs. develop (showing serial and omp for this branch against axom@develop)
Showing SEQ and OMP with {1,2,4,8,16,32,64} threads and run with

OMP_NUM_THREADS=<n> OMP_PLACES=cores OMP_PROC_BIND=close

Hashing 32K pairs ($2^{15}$)

core_flatmap_speedup_wall_N32768

Hashing 1M pairs ($2^{20}$)

core_flatmap_speedup_wall_N1048576

Hashing 32M pairs ($2^{25}$)

core_flatmap_speedup_wall_N33554432

@kennyweiss kennyweiss self-assigned this Jun 11, 2026
@kennyweiss kennyweiss added bug Something isn't working Core Issues related to Axom's 'core' component Performance Issues related to code performance labels Jun 11, 2026
@kennyweiss kennyweiss marked this pull request as ready for review June 12, 2026 22:11
@kennyweiss kennyweiss added this to the FY26 August release milestone Jun 13, 2026
@kennyweiss kennyweiss force-pushed the feature/kweiss/flatmap-improvements branch 2 times, most recently from 075a25c to 0e6a84b Compare June 16, 2026 16:21

@BradWhitlock BradWhitlock left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kennyweiss .

Adds typed tests covering assignment over a non-empty target,
source preservation, and self-assignment.
Removing it cannot break callers since this would not have compiled.
Const callers should use find()/at()/count()/contains().
at() throws std::out_of_range on a missing key.
DeviceHashHelper returned axom::IndexType and integer keys were converted
before the 64-bit mixer ran. With AXOM_USE_64BIT_INDEXTYPE=OFF every key wider than 32 bits
is truncated first, so keys equal mod 2^32 produce identical final hashes.
This was happening in the Morton codes in spin's SparseOctreeLevel and in numerics/quadrature.
The floating-point specialization returned the key converted to an integer.
Every key sharing an integer part therefore collided --
e.g. all numbers between -1 and 1 converted to the integer 0,
so a FlatMap keyed on fractional floats degenerated into one probe chain with O(size) inserts and finds
The quadratic probe advance in probeIndex and probeEmptyIndex wrapped
using a mod (%) operator. Since the group count is always a power of two,
we can use a bitmask instead.

Adds a cross-group probe stress test: a degenerate hash drives 600
keys through one initial group so inserts, lookups, misses, erases,
and reinserts all walk and wrap the group sequence.
BM_Find_Hit looks keys up in the order they were inserted.
Since node-based maps walk the heap nearly sequentially,
the hardware prefetcher hides their pointer-chasing latency.

This commit adds find_hit_shuffled (same keys, independently shuffled lookup order)
and find_hit_randkeys (distinct pseudorandom 64-bit keys, shuffled lookup order)
to better exhibit expected lookup behavior.
When find_with_hash() in not inlined, every lookup is more expensive
(extra registers, and a stack spill for the key) and requires loop-invariant setup
that cannot be hoisted out of the caller's lookup loop.

Forcing the probe path inline removed 20-40% of find_hit time and 15-35%
of find_miss time for FlatMap<int64,int64> at n = 2^16 and 2^20.
`getEmplacePos()` computed `Hash{}(key)`, then called `find(key)`,
which hashed the same key a second time.
It then performed a floating-point division against MAX_LOAD_FACTOR
on every insertion to decide whether to grow.

Note: This reduced instruction count but the performance improvements
within run-to-run noise in our measurements.
FlatMap rounds its group count up to a power of two, so for a fixed
element count the achievable load factors form a geometric ladder and a
nominal target is quantized to the next rung at or below it. At n = 2^16
the 0.70 target and the default reserve(n) geometry coincide (actual load
factor 0.533, which is why find_hit_lf0p70 reproduced find_hit to within
noise), and the 0.50 target lands at 0.267 -- a table twice as large.
That scenario was really measuring a larger working set, not a shorter
probe sequence.
The SSE2 path of GroupBucket::visitHashBucket() stops visiting as soon as
the visitor returns false, but the scalar fallback (including GPU path)
ignored the return value and kept scanning all 15 slots.

In-tree visitors and the duplicate check in the batched insert path
return false to mean 'stop', and extra visits load and compare a key
which could incur a cache miss per probe group.
Emplacing a new key walked the probe sequence twice -- first to check
for a key and then to find an empty slot within the key. We now do
both within a single call.
* Disables sequential find_hit search by default since it is not representative.
* Guards several tests by the feature they are testing
Also adds more device hashing tests
Also improves device hashing of floating point types (float and long double).
@kennyweiss kennyweiss force-pushed the feature/kweiss/flatmap-improvements branch from 0e6a84b to d8bb8e9 Compare June 18, 2026 00:27
@kennyweiss kennyweiss merged commit e8cf58d into develop Jun 18, 2026
15 checks passed
@kennyweiss kennyweiss deleted the feature/kweiss/flatmap-improvements branch June 18, 2026 04:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working Core Issues related to Axom's 'core' component Performance Issues related to code performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants