Skip to content

Add POWER10 optimization (WIP)#391

Closed
runlevel5 wants to merge 1 commit into
ec-:mainfrom
runlevel5:optimisation
Closed

Add POWER10 optimization (WIP)#391
runlevel5 wants to merge 1 commit into
ec-:mainfrom
runlevel5:optimisation

Conversation

@runlevel5
Copy link
Copy Markdown
Contributor

Being tested by PPC community testers. Please refrain from reviewing it

Gated behind USE_ISA_3_1 (-mcpu=power10). Each prefixed instruction is
emitted as a stable 12-byte block to keep per-pass code size deterministic
across the existing multi-pass compile.

- pli in emit_MOVi64 for values that fit signed 34 bits but not signed 32
- paddi in mov_rx_local and emit_CheckJump long-offset paths
- plbz/plhz/plwz/plfs/pstb/psth/pstw/pstfs in OP_LOAD/OP_STORE long-offset
  paths (no temp register needed)

Encoding cross-checked byte-for-byte against GNU as -mpower10. Builds
clean at -mcpu=power8/9/10 on ppc64le.
@runlevel5
Copy link
Copy Markdown
Contributor Author

Dropped because after thorough testing the gain is too marginal. In some cases it is even worse.

@runlevel5 runlevel5 closed this May 8, 2026
@ec-
Copy link
Copy Markdown
Owner

ec- commented May 8, 2026

Dropped because after thorough testing the gain is too marginal. In some cases it is even worse.

How did you measured gains - by size/performance?

@runlevel5
Copy link
Copy Markdown
Contributor Author

Yes I have benchmarked (built with GCC, clang and IBM XL C/C++ compiler). Some iterations get like 1.4% +/- 0.5, and some cases I see regression. On POWER10 with -O3 and fastmath option, the compiler aggressively optimize the codes to the point our hand-rolled assembly does not do justice. I also experimented a bit with IBM MASS library, again the gain is too marginal, it's not worth the additional complexity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants