Speed up `fmpz_mat_charpoly` by fredrik-johansson · Pull Request #2691 · flintlib/flint

fredrik-johansson · 2026-05-20T12:26:21Z

Summary:

We add fmpz_mat_charpoly_bound and use this in fmpz_mat_charpoly_modular. Instead of the old crude bound based on the height of the matrix, we compute Hadamard bounds for the determinants in the trace sum formula for the charpoly. This still costs only $O(n^2)$, gives a slightly better bound for uniform matrices, and gives dramatically better bounds for sparse and non-uniform matrices.
We select appropriately between Berkowitz and the modular algorithm for small matrices in fmpz_mat_charpoly (previously the modular algorithm was always used for $n \ge 4$), speeding up small matrices and matrices with huge entries.
The modular algorithm is reimplemented using asymptotically fast modular reduction and CRT, so the complexity scales quasi-linearly with the bit size of the output and not quasi-quadratically.
We make the modular algorithm multithreaded.
Add helper function _fmpz_vec_multi_CRT_ui.

Part of the code was written with the help of Claude.

Not done: detecting special cases (other than the zero matrix). One could easily detect triangular and Hessenberg matrices and handle these specially. I think nilpotent and single-eigenvalue (charpoly $(x-c)^n$) matrices could also be detected and certified faster than the general algorithm (with negligible overhead for generic input). Other cases where the characteristic polynomial is very sparse and has much smaller coefficients than the generic bound might also be doable with early termination based on direct verification, though this gets a bit more complicated (especially to ensure that this is actually only done when it is faster and that the generic case doesn't slow down).

Timings for uniform random matrices with randbits(bits) entries, both single-threaded and multithreaded. Note that "old" includes the preexisting 30% speedup of the modular algorithm from #2684.

                              1 thread                        8 threads
     n      bits  |       old        new   speedup |       old        new   speedup |

     4        16  |  1.52e-06   4.58e-07    3.319  |   1.47e-06   4.54e-07    3.238 |
     4        64  |  3.82e-06   1.39e-06    2.748  |   3.89e-06   1.38e-06    2.819 |
     4       256  |  1.61e-05   1.94e-06    8.299  |   1.61e-05   1.94e-06    8.299 |
     4      1024  |  0.000125   1.22e-05   10.246  |   0.000122   1.22e-05   10.000 |
     4      4096  |   0.00126   0.000119   10.588  |    0.00126   0.000119   10.588 |
     4     16384  |     0.014   0.000998   14.028  |      0.014      0.001   14.000 |
     4     65536  |      0.19    0.00482   39.419  |      0.188    0.00372   50.538 |
     4    262144  |     2.905     0.0189  153.704  |      2.889     0.0065  444.462 |

     8        16  |  8.59e-06    6.7e-06    1.282  |   8.56e-06   6.64e-06    1.289 |
     8        64  |  2.74e-05   1.48e-05    1.851  |   2.74e-05   1.47e-05    1.864 |
     8       256  |  0.000118   3.93e-05    3.003  |   0.000118   3.95e-05    2.987 |
     8      1024  |   0.00097   0.000382    2.539  |   0.000957   0.000382    2.505 |
     8      4096  |   0.00815    0.00395    2.063  |     0.0081    0.00394    2.056 |
     8     16384  |    0.0947     0.0323    2.932  |     0.0947     0.0331    2.861 |
     8     65536  |     1.363      0.126   10.817  |      1.362      0.067   20.328 |
     8    262144  |     21.55      0.519   41.522  |     21.571      0.156  138.276 |

    16        16  |  5.42e-05   5.67e-05    0.956  |    5.4e-05   3.34e-05    1.617 |
    16        64  |  0.000196   0.000182    1.077  |   0.000195    5.8e-05    3.362 |
    16       256  |   0.00119    0.00106    1.123  |    0.00117    0.00025    4.680 |
    16      1024  |   0.00737    0.00654    1.127  |    0.00736    0.00207    3.556 |
    16      4096  |    0.0606     0.0484    1.252  |     0.0606     0.0153    3.961 |
    16     16384  |     0.679      0.335    2.027  |      0.676       0.11    6.145 |
    16     65536  |    10.755      3.622    2.969  |     10.759      1.465    7.344 |

    32        16  |  0.000614   0.000608    1.010  |   0.000609   0.000269    2.264 |
    32        64  |   0.00233    0.00219    1.064  |    0.00234    0.00045    5.200 |
    32       256  |    0.0122    0.00981    1.244  |     0.0106     0.0023    4.609 |
    32      1024  |    0.0639     0.0524    1.219  |     0.0638     0.0123    5.187 |
    32      4096  |     0.508      0.368    1.380  |      0.505      0.093    5.430 |
    32     16384  |     5.873      2.452    2.395  |      5.837      0.515   11.334 |
    32     65536  |    87.079     17.669    4.928  |     87.046      3.587   24.267 |

    64        16  |    0.0116    0.00726    1.598  |    0.00804     0.0023    3.496 |
    64        64  |    0.0294     0.0284    1.035  |     0.0296     0.0056    5.286 |
    64       256  |     0.148      0.109    1.358  |      0.127      0.023    5.522 |
    64      1024  |      0.69      0.547    1.261  |      0.688      0.108    6.370 |
    64      4096  |      4.75      3.324    1.429  |      4.692      0.575    8.160 |
    64     16384  |    54.059     20.722    2.609  |     54.868      3.627   15.128 |

   128        16  |     0.176      0.125    1.408  |      0.134      0.022    6.091 |
   128        64  |     0.472       0.46    1.026  |      0.469      0.088    5.330 |
   128       256  |     1.894      1.708    1.109  |      1.852      0.279    6.638 |
   128      1024  |     8.795      7.571    1.162  |      8.717      1.231    7.081 |
   128      4096  |    63.556      39.46    1.611  |     63.049      6.479    9.731 |

   256        16  |     2.437       2.23    1.093  |      2.407       0.41    5.871 |
   256        64  |     8.202      7.697    1.066  |      8.145      1.232    6.611 |
   256       256  |    31.272     29.737    1.052  |     31.442      4.745    6.626 |

Example timings for sparse matrices: the input is the companion matrix of a polynomial with randbits(bits) entries, plus an extra 1 in position (0, 0) of the matrix.

                              1 thread                        8 threads
     n      bits  |       old        new   speedup |       old        new   speedup |

     4        16  |  1.44e-06   3.87e-07    3.721  |   1.39e-06   3.81e-07    3.648 |
     4        64  |  3.62e-06    7.5e-07    4.827  |   3.64e-06   7.55e-07    4.821 |
     4       256  |  1.45e-05   9.01e-07   16.093  |   1.51e-05   9.01e-07   16.759 |
     4      1024  |  9.22e-05   3.18e-06   28.994  |    9.2e-05   3.18e-06   28.931 |
     4      4096  |  0.000965   2.69e-05   35.874  |   0.000962   2.68e-05   35.896 |
     4     16384  |   0.00908    0.00021   43.238  |     0.0089   0.000207   42.995 |
     4     65536  |      0.11   0.000999  110.110  |       0.11    0.00089  123.596 |
     4    262144  |      1.47    0.00452  325.221  |      1.463    0.00155  943.871 |

     8        16  |  8.25e-06      3e-06    2.750  |   8.19e-06   2.97e-06    2.758 |
     8        64  |   2.7e-05   4.16e-06    6.490  |   2.59e-05   4.09e-06    6.333 |
     8       256  |  0.000106   5.45e-06   19.450  |   0.000103   5.42e-06   19.004 |
     8      1024  |  0.000663   2.67e-05   24.831  |   0.000668   2.64e-05   25.303 |
     8      4096  |   0.00517   0.000226   22.876  |    0.00518   0.000227   22.819 |
     8     16384  |    0.0486    0.00178   27.303  |     0.0494    0.00183   26.995 |
     8     65536  |     0.621    0.00715   86.853  |      0.595     0.0044  135.227 |
     8    262144  |    10.553     0.0308  342.630  |     10.216     0.0101 1011.485 |

    16        16  |  4.15e-05   1.06e-05    3.915  |   4.15e-05   1.04e-05    3.990 |
    16        64  |  0.000151   2.06e-05    7.330  |   0.000151    2.2e-05    6.864 |
    16       256  |  0.000874   4.34e-05   20.138  |    0.00087   2.42e-05   35.950 |
    16      1024  |   0.00487   0.000148   32.905  |    0.00491    5.1e-05   96.275 |
    16      4096  |    0.0305    0.00089   34.270  |     0.0304   0.000491   61.914 |
    16     16384  |     0.334    0.00575   58.087  |      0.325     0.0034   95.588 |
    16     65536  |     4.558     0.0524   86.985  |       4.57     0.0262  174.427 |
    16    262144  |     66.04       0.23  287.130  |     65.717      0.074  888.068 |

    32        16  |  0.000373   4.27e-05    8.735  |   0.000368   4.23e-05    8.700 |
    32        64  |   0.00154   7.89e-05   19.518  |    0.00153   5.84e-05   26.199 |
    32       256  |   0.00667   0.000178   37.472  |     0.0066    7.2e-05   91.667 |
    32      1024  |    0.0324   0.000702   46.154  |     0.0322    0.00033   97.576 |
    32      4096  |     0.258    0.00305   84.590  |      0.224     0.0006  373.333 |
    32     16384  |     2.675     0.0157  170.382  |      2.664     0.0065  409.846 |
    32     65536  |     29.96     0.0733  408.731  |     30.033      0.036  834.250 |

    64        16  |   0.00413   0.000383   10.783  |     0.0041   0.000242   16.942 |
    64        64  |    0.0141    0.00057   24.737  |     0.0137   0.000262   52.290 |
    64       256  |    0.0575    0.00117   49.145  |      0.057    0.00044  129.545 |
    64      1024  |     0.286    0.00353   81.020  |      0.274    0.00166  165.060 |
    64      4096  |     1.798     0.0133  135.188  |      1.773     0.0027  656.667 |
    64     16384  |    18.694     0.0603  310.017  |     18.775     0.0259  724.903 |

   128        16  |     0.067    0.00478   14.017  |     0.0669      0.002   33.450 |
   128        64  |     0.227    0.00638   35.580  |      0.225    0.00207  108.696 |
   128       256  |     0.882     0.0111   79.459  |      0.868     0.0023  377.391 |
   128      1024  |     3.797     0.0302  125.728  |      3.781     0.0066  572.879 |
   128      4096  |    20.906      0.113  185.009  |     21.659      0.019 1139.947 |

   256        16  |     1.378     0.0809   17.033  |      1.372      0.024   57.167 |
   256        64  |     4.733      0.101   46.861  |      4.732      0.024  197.167 |
   256       256  |    17.785      0.146  121.815  |     17.316      0.081  213.778 |

…optimizations

edgarcosta · 2026-05-20T19:23:26Z

A few small typos I noticed in this PR:

Copyrigth → Copyright in the new copyright headers: charpoly.c L3 and charpoly_modular.c L3.
t-multi_CRT_ui.c L86-L87: the failure branch prints both vectors labelled A =; the second should be B =.
fmpz_mat.rst L882: "The default algorithm handles for various special cases" has a stray "for".

fredrik-johansson added 3 commits May 20, 2026 11:53

Add fmpz_mat_charpoly_bound

7713a7f

Rewrite fmpz_mat_charpoly_modular and fmpz_mat_charpoly with several …

e5b852c

…optimizations

Document _fmpz_vec_multi_CRT_ui

53ed8bb

Fix typos

1c901f9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up `fmpz_mat_charpoly`#2691

Speed up `fmpz_mat_charpoly`#2691
fredrik-johansson wants to merge 4 commits into
flintlib:mainfrom
fredrik-johansson:charpoly

fredrik-johansson commented May 20, 2026 •

edited

Loading

Uh oh!

edgarcosta commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fredrik-johansson commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

edgarcosta commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fredrik-johansson commented May 20, 2026 •

edited

Loading