Added StreamLoad op#2044
Conversation
|
I'm concerned about performance and correctness on x86. _mm_stream_load_si128 is super slow (hundreds of cycles) and only really intended for WC memory i.e. memory mapped I/O. It does seem useful for drivers that actually do want to bulk-load from WC: https://community.intel.com/t5/Intel-ISA-Extensions/Do-Non-Temporal-Loads-Prefetch/m-p/1027104 If so, then we also have errata HSD162, BDM116 and SKL079 to deal with, concerning ordering with respect to LOCK and MFENCE. Possibly we can just document that. If it's rather the hope that when we load from normal WB memory, that the cache line is marked as preferred for discarding, do we have evidence of a benefit? The past few times I've tried this and similar things, I was disappointed. Possible options: rely on prefetches to set the hint we'd like before the actual load, and/or make the x86 StreamLoad equivalent to Load if you'd still like to target the SVE instruction. What do you think? |
Added the StreamLoad op as SSE4/AVX2/AVX3/PPC have non-temporal aligned load instructions for vectors that are 16 bytes or larger and as SVE has non-temporal load instructions for all vector sizes.