diff --git a/rfcs/proposed/numa_support/README.md b/rfcs/proposed/numa_support/README.md index 38f678d3e7..5dc305e899 100755 --- a/rfcs/proposed/numa_support/README.md +++ b/rfcs/proposed/numa_support/README.md @@ -139,6 +139,8 @@ This [sub-proposal is supported](../../supported/numa_support/create-numa-arenas Define allocators or other features that simplify the process of allocating or placing data onto specific NUMA nodes. +[Interleaved allocation](interleaved-allocation.md) can be a useful kind of NUMA-aware allocations. + ### Simplified approaches to associate task distribution with data placement As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures. diff --git a/rfcs/proposed/numa_support/interleaved-allocation.md b/rfcs/proposed/numa_support/interleaved-allocation.md new file mode 100644 index 0000000000..92a672248a --- /dev/null +++ b/rfcs/proposed/numa_support/interleaved-allocation.md @@ -0,0 +1,70 @@ +# API to allocate memory interleaved between NUMA nodes + +*Note:* This document is a sub-RFC of the [umbrella RFC about improving NUMA +support](README.md). + +## Motivation + +There are two kinds of NUMA-related performance bottlenecks: latency increasing due to +access to a remote node and bandwidth-limited simultaneous access from different CPUs to +a single NUMA memory node. A well-known method to mitigate both is a distribution of +memory objects that are accessed from different CPUs to different NUMA nodes in such a way +that matches an access pattern. If the access pattern is complex enough, a simple +round-robin distribution can be good enough. The distribution can be achieved either by employing a first-touch policy of NUMA memory allocation or via special platform-dependent +API. Generally, the latter requires less overhead. + +## Requirements to public API + +Free, stateless functions, similar to malloc, are sufficient for the allocation of large +blocks of memory. To guide the spreading of blocks across NUMA nodes, two additional +parameters are proposed: `interleaving step` and `list of NUMA nodes to perform +allocations on`. This single function then serves as a provider of memory blocks with at +least page granularity and will not employ internal caching. If high-performance, smaller +and repetitive allocations are needed, then `std::pmr` or other solutions should be used. + +`interleaving step` is the size of the contiguous memory block from a particular NUMA +node, it has page granularity. Currently there are no clear use cases for granularity more +than page size. + +`list of nodes for allocation` is conceptually a set of `tbb::numa_node_id`. However, +because `tbb::numa_nodes()` returns `std::vector` and creating a `std::set` from it +requires allocation, `vector` can be used. Because semantics of `tbb::numa_node_id` is +not defined, we can't use it to construct e.g., a bit mask. Allocation that is unbalanced +between NUMA nodes doesn't seem to have useful applications, so repeated elements in `list +of nodes` is an error. + +One use case for `list of nodes` argument is the desire to run parallel activity on subset of +nodes and so get memory only from those nodes. + +Most common usage of the allocation function is expected only with `size` parameter. +In this case, `interleaving_step` defaults to the page size and memory is allocated on all +NUMA nodes. + +```c++ +void *tbb::numa::alloc_interleaved(size_t size, size_t interleaving_step = 0, + const std::vector *nodes = nullptr); +void tbb::numa::free_interleaved(void *ptr, size_t size); +``` + +## Implementation details + +Under Linux, only allocations with default interleaving can be supported via HWLOC. Other +interleaving steps require direct libnuma usage, that creates yet another run-time +dependency. It's possible to implement allocation with constant number of system call wrt +allocation size. + +Under Windows, starting Windows 10 and WS 2016, `VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)` +can be used to provide desired interleaving, but number of system calls is proportional to +allocation size. For older Windows, either fallback to `VirtualAlloc` or manual touching +from threads pre-pinned to NUMA nodes can be used. + +There is no NUMA memory support under macOS, so the implementation can only fall back to +`malloc`. + +## Open Questions + +When non-default `interleaving step` can be used? + +`size` argument for `free_interleaved()` appeared because what we have is wrappers over +`mmap`/`munmap` and there is no place to put the size after memory is allocated. We can +put it in, say, an internal cumap. Is it look useful?