Added StreamLoad op by johnplatts · Pull Request #2044 · google/highway

johnplatts · 2024-03-29T16:26:32Z

Added the StreamLoad op as SSE4/AVX2/AVX3/PPC have non-temporal aligned load instructions for vectors that are 16 bytes or larger and as SVE has non-temporal load instructions for all vector sizes.

jan-wassenberg · 2024-04-02T05:02:22Z

I'm concerned about performance and correctness on x86. _mm_stream_load_si128 is super slow (hundreds of cycles) and only really intended for WC memory i.e. memory mapped I/O. It does seem useful for drivers that actually do want to bulk-load from WC: https://community.intel.com/t5/Intel-ISA-Extensions/Do-Non-Temporal-Loads-Prefetch/m-p/1027104
Is that the intended use case?

If so, then we also have errata HSD162, BDM116 and SKL079 to deal with, concerning ordering with respect to LOCK and MFENCE. Possibly we can just document that.

If it's rather the hope that when we load from normal WB memory, that the cache line is marked as preferred for discarding, do we have evidence of a benefit? The past few times I've tried this and similar things, I was disappointed.

Possible options: rely on prefetches to set the hint we'd like before the actual load, and/or make the x86 StreamLoad equivalent to Load if you'd still like to target the SVE instruction. What do you think?

Added StreamLoad op

0450d19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added StreamLoad op#2044

Added StreamLoad op#2044
johnplatts wants to merge 1 commit into
google:masterfrom
johnplatts:hwy_stream_load_032924

johnplatts commented Mar 29, 2024

Uh oh!

jan-wassenberg commented Apr 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

johnplatts commented Mar 29, 2024

Uh oh!

jan-wassenberg commented Apr 2, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants