Hint to the compiler to inline the fill_window function#270
Conversation
|
compare256 functions are not inlined to longest_match? |
They are in zlib-rs, although in zlib-ng they seem to be called through a function pointer (I guess to allow run-time selection of different SSE/AVX implementations from a single x86 build) |
|
What happens exactly depends on the target features that are enabled at compile time: on most x86_64 CPUs Combined with that, by default the avx2 intrinsics are not inlined, unless the surrounding function is marked as having @brian-pane I'm seeing an improvement for level 1, but a regression for compression level 2 Level 3 and onward appear fine. So this might need some further tweaking (maybe moving some cold code out of the function you now inline?) |
|
I moved what I could out of the inlined part in bb2c510. It might not be enough, though. |
|
Zlib-ng inlines compare256 by having associated versions of longest_match, then longest_match which is called more infrequently is called by a function pointer. Only direct function pointer to compare256 is called in deflate_quick. |
|
I've been playing around with this, and there is definitely something here, but I believe that the only case where this matters for I've done some experiments with adding But, there is not a very ergonomic way to specialize for a range of target features now (though there is a proposed project goal that hopefully brings a solution closer). I did experiment with using a function pointer dispatch method, and that does help a bit #273, but as mentioned using avx2 in more places is advantageous. We'll need to carefully weigh performance and maintainability here. |
bb2c510 to
cc6f005
Compare
folkertdev
left a comment
There was a problem hiding this comment.
after the recent changes (i guess to the state), just the inlining is an improvement now
Benchmark 2 (64 runs): target/release/examples/blogpost-compress 1 rs silesia-small.tar
measurement mean ± σ min … max outliers delta
wall_time 78.7ms ± 813us 77.6ms … 82.5ms 4 ( 6%) ⚡- 3.1% ± 0.4%
peak_rss 26.7MB ± 93.3KB 26.6MB … 26.9MB 0 ( 0%) + 0.1% ± 0.1%
cpu_cycles 286M ± 2.82M 283M … 299M 6 ( 9%) ⚡- 4.3% ± 0.4%
instructions 591M ± 268 591M … 591M 0 ( 0%) ⚡- 1.7% ± 0.0%
cache_references 19.9M ± 151K 19.7M … 20.5M 3 ( 5%) - 0.4% ± 0.3%
cache_misses 405K ± 73.2K 306K … 650K 3 ( 5%) - 0.7% ± 8.9%
branch_misses 2.97M ± 5.83K 2.96M … 3.00M 6 ( 9%) - 0.4% ± 0.1%
for the other levels there are some small improvements to instructions, but in any case no regressions.

Inspired by @nmoinvaz suggestion in issue #18, I went through the functions that currently aren't inlined and tried adding a compiler hint to inline them. Some didn't help or even caused regressions when inlined, but this one seems to improve performance at compression level 1:
while not hurting performance at higher compression levels: