Add peer tracking to merkle/sync#5415
Conversation
8fdc6dd to
c0cb54d
Compare
merkle/sync
c0cb54d to
f477c5e
Compare
f477c5e to
b2381be
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds peer performance tracking to database/merkle/sync by integrating p2p.PeerTracker, allowing the syncer to learn from slow/failed peers and prioritize better ones. It also updates the EVM/Firewood sync wiring to pass through the peer tracker and construct the appropriate p2p client.
Changes:
- Extend merkle sync configuration to accept an optional
*p2p.PeerTrackerand use it for peer selection and response/failure registration. - Update graft coreth/subnet-evm network facades to expose
PeerTracker()rather than bespoke response/failure registration methods. - Add a unit test verifying that
PeerTrackerdrives node selection when provided.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| graft/subnet-evm/network/network.go | Exposes the network’s PeerTracker() to consumers. |
| graft/coreth/network/network.go | Exposes the network’s PeerTracker() to consumers. |
| graft/evm/sync/engine/client.go | Passes PeerTracker into Firewood syncer config and constructs a dedicated p2p client via P2PNetwork().NewClient. |
| graft/evm/sync/client/test_network.go | Updates test network to satisfy the new PeerTracker() requirement. |
| graft/evm/sync/client/test_client.go | Updates Client interface conformance (now includes Network()), but currently returns nil. |
| graft/evm/sync/client/client.go | Switches request outcome tracking to use a held *p2p.PeerTracker instead of network callbacks; exposes Network(). |
| graft/evm/sync/client/BUILD.bazel | Adds new deps needed by updated test network code (prometheus/logging/set). |
| graft/evm/go.mod | Promotes prometheus/client_golang to a direct dependency (now imported). |
| database/merkle/sync/syncer.go | Adds PeerTracker to config and routes requests through it when provided (including bandwidth/failure recording). |
| database/merkle/sync/sync_test.go | Adds unit test ensuring PeerTracker drives peer selection. |
| database/merkle/sync/BUILD.bazel | Adds //utils/set dep for the new unit test. |
| database/merkle/firewood/syncer/syncer.go | Threads PeerTracker through to the underlying merkle sync config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| nodeID, ok := pt.SelectPeer() | ||
| if !ok { | ||
| return errors.New("no peers available") | ||
| } |
| func (*TestClient) Network() Network { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
I think this is correct -- compare to AddClient right?
| network := c.config.Client.Network() | ||
| stateSyncer, err = evmstate.NewFirewoodSyncer( |
JonathanOppenheimer
left a comment
There was a problem hiding this comment.
This is an important PR! you should also probably determine if the copilot comments are BS or not.
| // Test_Sync_WithPeerTracker ensures that the peer tracker is used for sampling if provided. | ||
| func Test_Sync_WithPeerTracker(t *testing.T) { |
There was a problem hiding this comment.
I've been reading through the go style guide and recall that this is allowed
Test, Benchmark and Example function names within *_test.go files may include underscores.
but we never seemingly do this for other tests. Why did you do it here?
| if len(s.config.StateSyncNodes) == 0 { | ||
| return client.AppRequestAny(ctx, requestBytes, onResponse) | ||
| switch { | ||
| case len(s.config.StateSyncNodes) > 0: |
There was a problem hiding this comment.
It doesn't appear that you test the precedence or ordering for this new behavior in the switch here. We have TestStateSyncNodes and the new Test_Sync_WithPeerTracker, but would a small precedence test for this code here make sense?
| @@ -144,14 +144,18 @@ type Syncer[R any, C any] struct { | |||
|
|
|||
| // TODO remove non-config values out of this struct | |||
| type Config[R any, C any] struct { | |||
There was a problem hiding this comment.
I wish we would merge the struct ordering linter lol
| // RegisterResponse records a successful response from nodeID with the | ||
| // observed bandwidth (response bytes divided by request time). | ||
| RegisterResponse(nodeID ids.NodeID, bandwidth float64) | ||
|
|
||
| // RegisterFailure records a failed response from nodeID. | ||
| RegisterFailure(nodeID ids.NodeID) |
| s.metrics.RequestMade() | ||
| } | ||
|
|
||
| type appResponseCallbackWithCheck func(ctx context.Context, nodeID ids.NodeID, responseBytes []byte, err error) bool |
There was a problem hiding this comment.
maybe withResult instead? It's not really a check.
| } | ||
| } | ||
|
|
||
| func sendRequestWithPeerTracker(ctx context.Context, c *p2p.Client, pt *p2p.PeerTracker, requestBytes []byte, onResponse appResponseCallbackWithCheck) error { |
There was a problem hiding this comment.
also why isn't this on the syncer?
| func (*TestClient) Network() Network { | ||
| return nil | ||
| } |
There was a problem hiding this comment.
I think this is correct -- compare to AddClient right?
| } | ||
|
|
||
| // PeerTracker returns a new empty peer tracker. | ||
| func (*testNetwork) PeerTracker() *p2p.PeerTracker { |
There was a problem hiding this comment.
do you want to cache the tracker like in production? If a test calls PeerTracker() twice, they'd see different things.
| } | ||
|
|
||
| // Test_Sync_WithPeerTracker ensures that the peer tracker is used for sampling if provided. | ||
| func Test_Sync_WithPeerTracker(t *testing.T) { |
There was a problem hiding this comment.
Two more test ideas -- right now this only tests which node was selected. You write in the description about testing difficulties but you could
- Add two peers, return a malformed proof for one, force a retry, assert the second request goes to the other peer (SelectPeer SHOULD prefer the responsive one).
- No-peers error path: don't call
pt.Connected, and assertSync()fails with the expected error (maybe make "no peers available" a sentinel error?)
| func sendRequestWithPeerTracker(ctx context.Context, c *p2p.Client, pt *p2p.PeerTracker, requestBytes []byte, onResponse appResponseCallbackWithCheck) error { | ||
| nodeID, ok := pt.SelectPeer() | ||
| if !ok { | ||
| return errors.New("no peers available") |
There was a problem hiding this comment.
So if a node starts state sync before any peer connects to the firewood handler, wouldn't the sync abort permanently? Shouldn't we retry in this case? or do something else?
There was a problem hiding this comment.
It looks like the EVM client will sleep and then retry:
avalanchego/graft/evm/sync/client/client.go
Lines 319 to 394 in b2381be

Why this should be merged
We need some method of peer discovery for merkle sync. Many nodes will not implement the
p2p.Clientfor Firewood, or drop the request, or not store enough state, or even generate malicious proofs. In all of these cases, the syncer should learn from this mistake and prioritize other peers.How this works
coreth and subnet-evm already use a
p2p.PeerTrackerto track bandwidth, so we can use it with this state syncer as well.How this was tested
I added a new unit test to check if the peer tracker is being used, but I don't know what else can be done to ensure responses are logged correctly, since the peer tracker exports very little data.
Need to be documented in RELEASES.md?
No