Skip to content

syncer: scale history buffer with visible memory#10696

Open
okJiang wants to merge 5 commits into
tikv:masterfrom
okJiang:codex/history-buffer-size-by-memory
Open

syncer: scale history buffer with visible memory#10696
okJiang wants to merge 5 commits into
tikv:masterfrom
okJiang:codex/history-buffer-size-by-memory

Conversation

@okJiang
Copy link
Copy Markdown
Member

@okJiang okJiang commented May 20, 2026

What problem does this PR solve?

Issue Number: Close #10692, ref #10666, ref #10668

What is changed and how does it work?

Scale the region syncer history buffer size from the visible memory capacity
instead of using a fixed 10000-entry default.

The new sizing keeps 10000 entries as the minimum, allocates another 10000
entries per 4 GiB of visible memory, and caps the buffer at 100000 entries.
The visible memory source is cgroup-aware through pkg/memory.

Add unit coverage for the sizing boundaries and linear scaling behavior.

Check List

  • Unit test
image

Release note

Scale PD region syncer history buffer capacity with visible memory, allocating 10,000 entries per 4 GiB and capping the buffer at 100,000 entries.

Summary by CodeRabbit

  • Improvements

    • History buffer sizing now adjusts dynamically based on available system memory with sensible lower and upper bounds for better resource use and stability.
  • Observability

    • Added a runtime metric for history buffer size and more consistent history index metrics.
    • Updated dashboard queries to better scope PD-sourced series and added a panel showing the history index gap and buffer size.
  • Tests

    • Added unit tests validating buffer sizing across low, scaled, and upper-clamped memory scenarios.

Review Change Stack

@ti-chi-bot ti-chi-bot Bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has signed the dco. labels May 20, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 20, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign bufferflies for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 20, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The syncer now sizes its history buffer from visible memory (clamped between 10000 and 100000 slots), uses that size when creating history buffers, centralizes history index metric updates and adds a historyBufferSize metric, updates Grafana panels, and adds unit tests for sizing.

Changes

Dynamic History Buffer Sizing and Metrics

Layer / File(s) Summary
Dynamic buffer sizing and tests
pkg/syncer/server.go, pkg/syncer/server_test.go
Imports memory, adds maxHistoryBufferSize and historyBufferMemoryStep, implements historyBufferSizeFromMemory(totalMemory uint64) int, uses it in NewRegionSyncer, and adds table-driven tests validating default, scaled, and clamped outputs.
History buffer metrics updates
pkg/syncer/history_buffer.go
Adds historyBufferSizeGauge, sets it during newHistoryBuffer, refactors index gauge updates into updateHistoryIndexMetrics() and calls it on load/record/reset, and simplifies persist() to only save next history index.
Grafana panels for history observability
metrics/grafana/pd.json
Scopes pd_region_syncer_status selectors to job=~".*pd.*" for existing Syncer index and History last index panels, and adds panel id 1503 (“History index gap”) plotting max(last_index)-min(sync_index) (filtered to job=~".*pd.*") alongside the history_buffer_size series.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • rleungx
  • bufferflies
  • lhy1024

Poem

🐰 I count the hops where histories keep,

From tiny roots to memory deep.
When heaps expand I stretch my net,
So lost indices we never get.
A whiskered cheer — the sync is neat.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: dynamically scaling the history buffer with visible memory instead of using a fixed default.
Linked Issues check ✅ Passed All coding requirements from issue #10692 are met: buffer sizing scales with visible memory (10K per 4GiB), clamped to 10K-100K range, respects cgroup limits via pkg/memory, and includes unit test coverage for boundaries and scaling behavior.
Out of Scope Changes check ✅ Passed Changes are appropriately scoped: pkg/syncer files implement the buffer sizing logic, tests validate the new function, and Grafana dashboard updates visualize related metrics without introducing unrelated functionality.
Description check ✅ Passed The PR description includes all required sections: problem statement with issue linking, detailed explanation of changes and implementation, unit test coverage checkbox, and a properly formatted release note.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label May 20, 2026
Signed-off-by: okjiang <819421878@qq.com>
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/syncer/server_test.go`:
- Around line 54-56: The subtests use the outer-scope require (re) which calls
FailNow on the outer *testing.T and aborts the entire table when one case fails;
update the subtest to accept its own *testing.T (change func(_ *testing.T) to
func(t *testing.T)) and use a subtest-local require (e.g., re := require.New(t))
or call require.Equal(t, ...) directly so each subtest (testing
historyBufferSizeFromMemory) fails independently and doesn't stop other cases
from running.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ba1e49ff-e661-4598-9363-04bba2810d97

📥 Commits

Reviewing files that changed from the base of the PR and between 68b43f6 and a0e06a2.

📒 Files selected for processing (1)
  • pkg/syncer/server_test.go

Comment thread pkg/syncer/server_test.go Outdated
Comment thread pkg/syncer/server.go
Signed-off-by: okjiang <819421878@qq.com>
@ti-chi-bot ti-chi-bot Bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 21, 2026
Comment thread metrics/grafana/pd.json
Comment thread metrics/grafana/pd.json Outdated
Signed-off-by: okjiang <819421878@qq.com>
@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented May 21, 2026

Checked the pull-unit-test-next-gen-3 failure on ba719b8. The failure is in resource manager integration tests (TestKeyspaceServiceLimit and TestWatchResourceGroup), not in the syncer or Grafana changes in this PR.

I pushed 1a70cf8 for the new dashboard review comments, so CI should rerun on the latest commit.

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented May 21, 2026

Checked the latest pull-unit-test-next-gen-2 failure on 1a70cf8. It failed in TestAffinityHandlerTestSuite/TestAffinityListWithIDs at tests/server/apiv2/handlers/affinity_test.go:861 (map[] had 0 items instead of 2). This is in affinity handler tests and is unrelated to this PR changes, which touch syncer sizing/metrics and Grafana panels.

Requesting a rerun for the required failed checks.

/retest-required

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented May 21, 2026

Checked the latest required test failures on 1a70cf8.

  • pull-unit-test-next-gen-3 failed in TestUpgradingPDAndTSOClusters with an etcd cluster ID mismatch while starting test servers.
  • pull-unit-test-next-gen-2 failed in TestForwardTestSuite with a 5m timeout while starting the test cluster.

Both failures are outside the files touched by this PR (pkg/syncer/* and metrics/grafana/pd.json) and do not point to this change. Requesting another rerun.

/retest-required

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented May 21, 2026

Checked the latest pull-unit-test-next-gen-3 failure on 1a70cf8. The Prow job failed in tests/integrations/mcs/router (TestServerTestSuite/TestBasicSync and TestRegionAPI).

This does not require a code change in this PR: the PR only changes syncer history buffer sizing/metrics and the Grafana panel, and the same two narrow router tests passed locally with failpoints enabled:

  • go test ./mcs/router -run TestServerTestSuite/TestBasicSync -count=1
  • go test ./mcs/router -run TestServerTestSuite/TestRegionAPI -count=1

Requesting a rerun for the required failed check.

/retest-required

@okJiang
Copy link
Copy Markdown
Member Author

okJiang commented May 21, 2026

Checked the latest pull-unit-test-next-gen-3 failure on 1a70cf8. The job failed in tests/integrations/client:

  • TestClientStatelessTestSuite/TestGetStore saw a store NodeState mismatch.
  • TestHTTPClientTestSuite/TestRedirectWithMetrics failed while the local test cluster was losing etcd leadership.

These failures are outside this PR changes (pkg/syncer/* and metrics/grafana/pd.json) and do not point to the history buffer sizing or dashboard update. The two narrow nextgen checks passed locally with failpoints enabled:

  • go test -tags nextgen ./client -run TestClientStatelessTestSuite/TestGetStore -count=1
  • go test -tags nextgen ./client -run TestHTTPClientTestSuite/TestRedirectWithMetrics -count=1

Requesting another rerun for the required failed check.

/retest-required

Comment thread metrics/grafana/pd.json Outdated
Comment thread metrics/grafana/pd.json
Signed-off-by: okjiang <819421878@qq.com>
Comment thread metrics/grafana/pd.json
"targets": [
{
"expr": "pd_region_syncer_status{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", type=\"sync_index\"}",
"expr": "pd_region_syncer_status{k8s_cluster=\"$k8s_cluster\", tidb_cluster=\"$tidb_cluster\", job=~\".*pd.*\", type=\"sync_index\"}",
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just delete the Sync index and History index directly? They don't seem to have much of a reason to exist.

What do you think? @rleungx

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented May 21, 2026

@okJiang: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-3 79a57b4 link true /test pull-unit-test-next-gen-3

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.07%. Comparing base (94df6da) to head (79a57b4).
⚠️ Report is 4 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master   #10696      +/-   ##
==========================================
+ Coverage   79.03%   79.07%   +0.04%     
==========================================
  Files         536      536              
  Lines       73103    73225     +122     
==========================================
+ Hits        57777    57904     +127     
+ Misses      11234    11224      -10     
- Partials     4092     4097       +5     
Flag Coverage Δ
unittests 79.07% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

syncer: scale region history buffer size with visible memory

1 participant