@@ -3,7 +3,6 @@ Title: Frame Pointers Everywhere: Enabling System-Level Observability for Python
33Author: Pablo Galindo Salgado <
[email protected] >,
4455 Savannah Ostrowski <
[email protected] >,
6- 76Discussions-To:
87Status: Draft
98Type: Standards Track
@@ -405,18 +404,18 @@ The JIT Compiler Needs Frame Pointers to Be Debuggable
405404------------------------------------------------------
406405
407406CPython's copy-and-patch JIT (:pep: `744 `) generates native machine code at
408- runtime. Without frame pointers in the interpreter , stack unwinding through
407+ runtime. Without reserved frame pointers in the JIT code , stack unwinding through
409408JIT frames is broken for virtually every tool in the ecosystem: GDB, LLDB,
410409libunwind, libdw (elfutils), py-spy, Austin, pystack, memray, ``perf ``, and
411410all eBPF-based profilers. Ensuring full-stack observability for JIT-compiled
412411code is a prerequisite for the JIT to be considered production-ready.
413412
414413Individual JIT stencils do not need frame-pointer prologues; the entire JIT
415414region can be treated as a single frameless region for unwinding purposes.
416- What matters is that the interpreter itself is built with frame pointers, so
415+ What matters is that the JIT itself is must reserve frame pointers, so
417416that the frame-pointer register (``%rbp `` on x86-64, ``x29 `` on AArch64) is
418417reserved and not clobbered by stencil code. With frame pointers in the
419- interpreter, unwinders can walk through JIT regions without needing to inspect
418+ JIT, most unwinders can walk through JIT regions without needing to inspect
420419individual stencils. This is a remarkably good outcome compared to other
421420JIT compilers (V8, LuaJIT, .NET CoreCLR, Julia, LLVM's ORC JIT), which
422421typically require hundreds to thousands of lines of code to implement custom
@@ -840,14 +839,14 @@ pyperformance JSON files can be found in
840839===================================== =======================
841840Machine Geometric mean overhead
842841===================================== =======================
843- Apple M2 Mac Mini (arm64) 1.01x slower
844- Intel Xeon Platinum 8480 (x86-64 ) 1.01x slower
845- AMD EPYC 9654 (x86-64) 1.01x slower
846- AWS Graviton c7g.16xlarge (aarch64) 1.02x slower
847- Ampere Altra Max (aarch64) 1.01x slower
848- Raspberry Pi (aarch64). 1.00x slower
849- macOS M3 Pro (arm64 ) 1.00x slower
850- Intel i7 12700H (x86-64) 1.02x slower
842+ Apple M2 Mac Mini (arm64) 1.006x slower
843+ macOS M3 Pro (arm64 ) 1.001x slower
844+ Raspberry Pi (aarch64). 1.002x slower
845+ Ampere Altra Max (aarch64) 1.020x slower
846+ AWS Graviton c7g.16xlarge (aarch64) 1.027x slower
847+ Intel i7 12700H (x86-64) 1.019x slower
848+ AMD EPYC 9654 (x86-64 ) 1.008x slower
849+ Intel Xeon Platinum 8480 (x86-64) 1.006x slower
851850===================================== =======================
852851
853852This overhead applies to both the interpreter and to C extensions that inherit
@@ -1048,7 +1047,20 @@ Footnotes
10481047Appendix
10491048========
10501049
1051- # TODO: KJ, once we have Diego's results.
1050+ For all graphs below, the green dots are geometric means of the
1051+ individual benchmark's median, while orange lines are the median of our data points.
1052+ Hollow circles reperesent outliers.
1053+
1054+ The first graph is the overall effect on pyperformance seen on each system.
1055+ Apart from the Ubuntu AWS Graviton System, all system configurations have below 2%
1056+ geometric mean and median slowdown:
1057+
1058+ .. image :: pep-0830_perf_over_baseline.svg
1059+
1060+ For individual benchmark results, see the following:
1061+
1062+ .. image :: pep-0830_perf_over_baseline_indiv.svg
1063+
10521064
10531065Copyright
10541066=========
0 commit comments