Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions rfcs/proposed/numa_support/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,8 @@ This [sub-proposal is supported](../../supported/numa_support/create-numa-arenas
Define allocators or other features that simplify the process of allocating or placing data onto
specific NUMA nodes.

[Interleaved allocation](interleaved-allocation.md) can be a useful kind of NUMA-aware allocations.

### Simplified approaches to associate task distribution with data placement

As discussed earlier, NUMA-aware allocation is just the first step in optimizing for NUMA architectures.
Expand Down
70 changes: 70 additions & 0 deletions rfcs/proposed/numa_support/interleaved-allocation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# API to allocate memory interleaved between NUMA nodes

*Note:* This document is a sub-RFC of the [umbrella RFC about improving NUMA
support](README.md).

## Motivation

There are two kinds of NUMA-related performance bottlenecks: latency increasing due to
access to a remote node and bandwidth-limited simultaneous access from different CPUs to
a single NUMA memory node. A well-known method to mitigate both is a distribution of
memory objects that are accessed from different CPUs to different NUMA nodes in such a way
that matches an access pattern. If the access pattern is complex enough, a simple
round-robin distribution can be good enough. The distribution can be achieved either by employing a first-touch policy of NUMA memory allocation or via special platform-dependent
API. Generally, the latter requires less overhead.

## Requirements to public API

Free, stateless functions, similar to malloc, are sufficient for the allocation of large
blocks of memory. To guide the spreading of blocks across NUMA nodes, two additional
parameters are proposed: `interleaving step` and `list of NUMA nodes to perform
allocations on`. This single function then serves as a provider of memory blocks with at
least page granularity and will not employ internal caching. If high-performance, smaller
and repetitive allocations are needed, then `std::pmr` or other solutions should be used.

`interleaving step` is the size of the contiguous memory block from a particular NUMA
node, it has page granularity. Currently there are no clear use cases for granularity more
than page size.

`list of nodes for allocation` is conceptually a set of `tbb::numa_node_id`. However,
because `tbb::numa_nodes()` returns `std::vector` and creating a `std::set` from it
requires allocation, `vector` can be used. Because semantics of `tbb::numa_node_id` is
not defined, we can't use it to construct e.g., a bit mask. Allocation that is unbalanced
between NUMA nodes doesn't seem to have useful applications, so repeated elements in `list
of nodes` is an error.

One use case for `list of nodes` argument is the desire to run parallel activity on subset of
nodes and so get memory only from those nodes.

Most common usage of the allocation function is expected only with `size` parameter.
In this case, `interleaving_step` defaults to the page size and memory is allocated on all
NUMA nodes.

```c++
void *tbb::numa::alloc_interleaved(size_t size, size_t interleaving_step = 0,
const std::vector<tbb::numa_node_id> *nodes = nullptr);
void tbb::numa::free_interleaved(void *ptr, size_t size);
```

## Implementation details

Under Linux, only allocations with default interleaving can be supported via HWLOC. Other
interleaving steps require direct libnuma usage, that creates yet another run-time
dependency. It's possible to implement allocation with constant number of system call wrt
allocation size.

Under Windows, starting Windows 10 and WS 2016, `VirtualAlloc2(MEM_REPLACE_PLACEHOLDER)`
can be used to provide desired interleaving, but number of system calls is proportional to
allocation size. For older Windows, either fallback to `VirtualAlloc` or manual touching
from threads pre-pinned to NUMA nodes can be used.

There is no NUMA memory support under macOS, so the implementation can only fall back to
`malloc`.

## Open Questions

When non-default `interleaving step` can be used?

`size` argument for `free_interleaved()` appeared because what we have is wrappers over
`mmap`/`munmap` and there is no place to put the size after memory is allocated. We can
put it in, say, an internal cumap. Is it look useful?
Loading