syncer: scale history buffer with visible memory by okJiang · Pull Request #10696 · tikv/pd

okJiang · 2026-05-20T07:44:29Z

What problem does this PR solve?

Issue Number: Close #10692, ref #10666, ref #10668

What is changed and how does it work?

Scale the region syncer history buffer size from the visible memory capacity
instead of using a fixed 10000-entry default.

The new sizing keeps 10000 entries as the minimum, allocates another 10000
entries per 4 GiB of visible memory, and caps the buffer at 100000 entries.
The visible memory source is cgroup-aware through pkg/memory.

Add unit coverage for the sizing boundaries and linear scaling behavior.

Check List

Unit test

Release note

Scale PD region syncer history buffer capacity with visible memory, allocating 10,000 entries per 4 GiB and capping the buffer at 100,000 entries.

Summary by CodeRabbit

Improvements
- History buffer sizing now adjusts dynamically based on available system memory with sensible lower and upper bounds for better resource use and stability.
Observability
- Added a runtime metric for history buffer size and more consistent history index metrics.
- Updated dashboard queries to better scope PD-sourced series and added a panel showing the history index gap and buffer size.
Tests
- Added unit tests validating buffer sizing across low, scaled, and upper-clamped memory scenarios.

ti-chi-bot · 2026-05-20T07:44:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bufferflies for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai · 2026-05-20T07:44:43Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

The syncer now sizes its history buffer from visible memory (clamped between 10000 and 100000 slots), uses that size when creating history buffers, centralizes history index metric updates and adds a historyBufferSize metric, updates Grafana panels, and adds unit tests for sizing.

Changes

Dynamic History Buffer Sizing and Metrics

Layer / File(s)	Summary
Dynamic buffer sizing and tests `pkg/syncer/server.go`, `pkg/syncer/server_test.go`	Imports `memory`, adds `maxHistoryBufferSize` and `historyBufferMemoryStep`, implements `historyBufferSizeFromMemory(totalMemory uint64) int`, uses it in `NewRegionSyncer`, and adds table-driven tests validating default, scaled, and clamped outputs.
History buffer metrics updates `pkg/syncer/history_buffer.go`	Adds `historyBufferSizeGauge`, sets it during `newHistoryBuffer`, refactors index gauge updates into `updateHistoryIndexMetrics()` and calls it on load/record/reset, and simplifies `persist()` to only save next history index.
Grafana panels for history observability `metrics/grafana/pd.json`	Scopes `pd_region_syncer_status` selectors to `job=~".pd."` for existing Syncer index and History last index panels, and adds panel id 1503 (“History index gap”) plotting max(last_index)-min(sync_index) (filtered to `job=~".pd."`) alongside the `history_buffer_size` series.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

rleungx
bufferflies
lhy1024

Poem

🐰 I count the hops where histories keep,

From tiny roots to memory deep.
When heaps expand I stretch my net,
So lost indices we never get.
A whiskered cheer — the sync is neat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: dynamically scaling the history buffer with visible memory instead of using a fixed default.
Linked Issues check	✅ Passed	All coding requirements from issue `#10692` are met: buffer sizing scales with visible memory (10K per 4GiB), clamped to 10K-100K range, respects cgroup limits via pkg/memory, and includes unit test coverage for boundaries and scaling behavior.
Out of Scope Changes check	✅ Passed	Changes are appropriately scoped: pkg/syncer files implement the buffer sizing logic, tests validate the new function, and Grafana dashboard updates visualize related metrics without introducing unrelated functionality.
Description check	✅ Passed	The PR description includes all required sections: problem statement with issue linking, detailed explanation of changes and implementation, unit test coverage checkbox, and a properly formatted release note.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: okjiang <819421878@qq.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/syncer/server_test.go`:
- Around line 54-56: The subtests use the outer-scope require (re) which calls
FailNow on the outer *testing.T and aborts the entire table when one case fails;
update the subtest to accept its own *testing.T (change func(_ *testing.T) to
func(t *testing.T)) and use a subtest-local require (e.g., re := require.New(t))
or call require.Equal(t, ...) directly so each subtest (testing
historyBufferSizeFromMemory) fails independently and doesn't stop other cases
from running.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba1e49ff-e661-4598-9363-04bba2810d97

📥 Commits

Reviewing files that changed from the base of the PR and between 68b43f6 and a0e06a2.

📒 Files selected for processing (1)

pkg/syncer/server_test.go

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-05-21T08:00:57Z

Checked the pull-unit-test-next-gen-3 failure on ba719b8. The failure is in resource manager integration tests (TestKeyspaceServiceLimit and TestWatchResourceGroup), not in the syncer or Grafana changes in this PR.

I pushed 1a70cf8 for the new dashboard review comments, so CI should rerun on the latest commit.

okJiang · 2026-05-21T08:12:19Z

Checked the latest pull-unit-test-next-gen-2 failure on 1a70cf8. It failed in TestAffinityHandlerTestSuite/TestAffinityListWithIDs at tests/server/apiv2/handlers/affinity_test.go:861 (map[] had 0 items instead of 2). This is in affinity handler tests and is unrelated to this PR changes, which touch syncer sizing/metrics and Grafana panels.

Requesting a rerun for the required failed checks.

/retest-required

okJiang · 2026-05-21T08:28:32Z

Checked the latest required test failures on 1a70cf8.

pull-unit-test-next-gen-3 failed in TestUpgradingPDAndTSOClusters with an etcd cluster ID mismatch while starting test servers.
pull-unit-test-next-gen-2 failed in TestForwardTestSuite with a 5m timeout while starting the test cluster.

Both failures are outside the files touched by this PR (pkg/syncer/* and metrics/grafana/pd.json) and do not point to this change. Requesting another rerun.

/retest-required

okJiang · 2026-05-21T08:44:48Z

Checked the latest pull-unit-test-next-gen-3 failure on 1a70cf8. The Prow job failed in tests/integrations/mcs/router (TestServerTestSuite/TestBasicSync and TestRegionAPI).

This does not require a code change in this PR: the PR only changes syncer history buffer sizing/metrics and the Grafana panel, and the same two narrow router tests passed locally with failpoints enabled:

go test ./mcs/router -run TestServerTestSuite/TestBasicSync -count=1
go test ./mcs/router -run TestServerTestSuite/TestRegionAPI -count=1

Requesting a rerun for the required failed check.

/retest-required

okJiang · 2026-05-21T08:58:45Z

Checked the latest pull-unit-test-next-gen-3 failure on 1a70cf8. The job failed in tests/integrations/client:

TestClientStatelessTestSuite/TestGetStore saw a store NodeState mismatch.
TestHTTPClientTestSuite/TestRedirectWithMetrics failed while the local test cluster was losing etcd leadership.

These failures are outside this PR changes (pkg/syncer/* and metrics/grafana/pd.json) and do not point to the history buffer sizing or dashboard update. The two narrow nextgen checks passed locally with failpoints enabled:

go test -tags nextgen ./client -run TestClientStatelessTestSuite/TestGetStore -count=1
go test -tags nextgen ./client -run TestHTTPClientTestSuite/TestRedirectWithMetrics -count=1

Requesting another rerun for the required failed check.

/retest-required

Signed-off-by: okjiang <819421878@qq.com>

okJiang · 2026-05-21T10:37:56Z

          "targets": [
            {
-              "expr": "pd_region_syncer_status{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", type=\"sync_index\"}",
+              "expr": "pd_region_syncer_status{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*pd.*\", type=\"sync_index\"}",


Should we just delete the Sync index and History index directly? They don't seem to have much of a reason to exist.

What do you think? @rleungx

ti-chi-bot · 2026-05-21T10:45:41Z

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-unit-test-next-gen-3	`79a57b4`	link	true	`/test pull-unit-test-next-gen-3`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

codecov · 2026-05-21T10:46:15Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.07%. Comparing base (94df6da) to head (79a57b4).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master   #10696      +/-   ##
==========================================
+ Coverage   79.03%   79.07%   +0.04%     
==========================================
  Files         536      536              
  Lines       73103    73225     +122     
==========================================
+ Hits        57777    57904     +127     
+ Misses      11234    11224      -10     
- Partials     4092     4097       +5

Flag	Coverage Δ
unittests	`79.07% <100.00%> (+0.04%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

syncer: scale history buffer with visible memory

68b43f6

ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 20, 2026

ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 20, 2026

syncer: fix history buffer static lint

a0e06a2

Signed-off-by: okjiang <819421878@qq.com>

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

Comment thread pkg/syncer/server_test.go Outdated