Skip to content

bgp_ls_ted: fix issue of withdrawing edges#21490

Open
a114j0y wants to merge 1 commit intoFRRouting:masterfrom
a114j0y:master
Open

bgp_ls_ted: fix issue of withdrawing edges#21490
a114j0y wants to merge 1 commit intoFRRouting:masterfrom
a114j0y:master

Conversation

@a114j0y
Copy link
Copy Markdown

@a114j0y a114j0y commented Apr 9, 2026

  1. withdraw stale Edges before updating attr
  2. withdraw related Links before deleting the Vertex

1. withdraw stale Edges before updating attr
2. withdraw related Links before deleting the Vertex

Signed-off-by: Yijiao Qin <y.qin@alibaba-inc.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 9, 2026

Greptile Summary

This PR fixes two BGP-LS TED withdrawal gaps: it pre-withdraws outgoing/incoming edges and attached prefix NLRIs before a vertex is torn down (so source/destination pointers are still valid), and it pre-withdraws the old edge NLRI before an attribute UPDATE is applied in-place.

  • Spurious flog_err on vertex-deletion sequences: after vertex A is deleted, ls_vertex_del sets B→A->destination = NULL but leaves the edge in B's outgoing_edges. When B is later deleted the new loop calls bgp_ls_process_edge on that half-disconnected edge and logs a false EC_BGP_LS_PACKET error; adding if (e->source && e->destination) guards resolves this.
  • UPDATE with key change may leave stale NLRI: ls_find_edge_by_source is keyed on the new attributes; if the local-address component of the key changed, the old edge is not found and its BGP route is not withdrawn before the new one is advertised.

Confidence Score: 3/5

Safe to merge after the spurious-error and potential stale-NLRI concerns are addressed; the core withdrawal-before-delete logic is sound.

Two issues remain: a P1 spurious flog_err that fires on every half-disconnected edge during sequential vertex deletions (correct BGP behavior but noisy and misleading in production), and a P2 edge-key-change gap in the UPDATE withdraw path that can leave stale NLRIs on peers. The fundamental approach is correct and well-reasoned.

bgpd/bgp_ls_ted.c — specifically the outgoing/incoming edge loops inside the LS_MSG_EVENT_DELETE block and the ls_find_edge_by_source lookup in the LS_MSG_EVENT_UPDATE block.

Important Files Changed

Filename Overview
bgpd/bgp_ls_ted.c Adds pre-deletion BGP withdrawal of related edges/prefixes before vertex teardown, and a pre-update withdrawal of the old edge before attribute replacement; the vertex-deletion loops lack NULL guards that can trigger a spurious flog_err per half-disconnected edge, and the UPDATE-path key-change scenario may leave stale NLRIs in peers.

Sequence Diagram

sequenceDiagram
    participant Z as Zebra
    participant P as bgp_ls_process_message
    participant E as bgp_ls_process_edge
    participant V as bgp_ls_process_vertex
    participant T as ls_vertex_del_all

    Z->>P: NODE DELETE vertex A
    P->>P: ls_msg2vertex vertex A
    loop outgoing edges of A
        P->>E: bgp_ls_process_edge DELETE
        E-->>P: BGP withdraw sent
    end
    loop incoming edges of A
        P->>E: bgp_ls_process_edge DELETE
        E-->>P: BGP withdraw sent
    end
    P->>V: bgp_ls_process_vertex DELETE
    V-->>P: Node NLRI withdrawn
    P->>T: ls_vertex_del_all A
    T-->>P: A freed, peer edge dest set to NULL

    Z->>P: NODE DELETE vertex B
    loop outgoing edges of B
        P->>E: bgp_ls_process_edge DELETE
        E-->>P: flog_err dest is NULL spurious error
    end
    P->>T: ls_vertex_del_all B

    Z->>P: ATTR UPDATE
    P->>P: ls_find_edge_by_source with new attrs
    alt old edge found same key
        P->>E: bgp_ls_process_edge DELETE old edge
        E-->>P: stale NLRI withdrawn
    else key changed old edge not found
        P-->>P: stale NLRI remains on peer
    end
    P->>P: ls_msg2edge update attrs in place
    P->>E: bgp_ls_process_edge UPDATE
    E-->>P: new NLRI advertised
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: bgpd/bgp_ls_ted.c
Line: 1167-1170

Comment:
**Spurious `flog_err` on partial-deletion of connected edges**

After vertex A is deleted, `ls_vertex_del` calls `ls_disconnect(A, B→A, false)`, setting `B→A->destination = NULL` while leaving the edge in B's `outgoing_edges` list. When vertex B is subsequently deleted, this new loop encounters `B→A` with `destination == NULL` and `bgp_ls_process_edge` fires `flog_err(EC_BGP_LS_PACKET, "BGP-LS: Edge missing source or destination vertex")` — even though the withdrawal was already correctly sent during A's `incoming_edges` loop. In a large topology this generates one false error per affected edge per vertex-deletion sequence.

Add a guard before each `bgp_ls_process_edge` call to skip edges that have already been partially disconnected:

```suggestion
			for (ALL_LIST_ELEMENTS_RO(vertex->outgoing_edges, lnode, e)) {
				if (e->source && e->destination)
					bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
			}
			for (ALL_LIST_ELEMENTS_RO(vertex->incoming_edges, lnode, e)) {
				if (e->source && e->destination)
					bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
			}
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: bgpd/bgp_ls_ted.c
Line: 1185-1191

Comment:
**Stale withdrawal misses when edge key changes on UPDATE**

`ls_find_edge_by_source` derives the lookup key from `msg->data.attr` (the *new* attributes). If the UPDATE changes the attribute that forms the edge key (e.g. the local interface address used by `get_edge_key`), the old edge in TED will have a different key and won't be found here — leaving the old BGP route unwithdrawn. `ls_msg2edge` will then insert a new edge under the new key, and the peer ends up holding both the stale old NLRI and the fresh new one.

A more robust approach is to look up the old edge using the existing key stored in TED *before* the attribute is updated, or to check whether `ls_msg2edge` returns a newly-inserted edge vs. an in-place update, and only send the pre-update withdraw when a key change is detected.

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "bgp_ls_ted: fix issue of withdrawing edg..." | Re-trigger Greptile

Comment on lines +1167 to +1170
for (ALL_LIST_ELEMENTS_RO(vertex->outgoing_edges, lnode, e))
bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
for (ALL_LIST_ELEMENTS_RO(vertex->incoming_edges, lnode, e))
bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Spurious flog_err on partial-deletion of connected edges

After vertex A is deleted, ls_vertex_del calls ls_disconnect(A, B→A, false), setting B→A->destination = NULL while leaving the edge in B's outgoing_edges list. When vertex B is subsequently deleted, this new loop encounters B→A with destination == NULL and bgp_ls_process_edge fires flog_err(EC_BGP_LS_PACKET, "BGP-LS: Edge missing source or destination vertex") — even though the withdrawal was already correctly sent during A's incoming_edges loop. In a large topology this generates one false error per affected edge per vertex-deletion sequence.

Add a guard before each bgp_ls_process_edge call to skip edges that have already been partially disconnected:

Suggested change
for (ALL_LIST_ELEMENTS_RO(vertex->outgoing_edges, lnode, e))
bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
for (ALL_LIST_ELEMENTS_RO(vertex->incoming_edges, lnode, e))
bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
for (ALL_LIST_ELEMENTS_RO(vertex->outgoing_edges, lnode, e)) {
if (e->source && e->destination)
bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
}
for (ALL_LIST_ELEMENTS_RO(vertex->incoming_edges, lnode, e)) {
if (e->source && e->destination)
bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
}
Prompt To Fix With AI
This is a comment left during a code review.
Path: bgpd/bgp_ls_ted.c
Line: 1167-1170

Comment:
**Spurious `flog_err` on partial-deletion of connected edges**

After vertex A is deleted, `ls_vertex_del` calls `ls_disconnect(A, B→A, false)`, setting `B→A->destination = NULL` while leaving the edge in B's `outgoing_edges` list. When vertex B is subsequently deleted, this new loop encounters `B→A` with `destination == NULL` and `bgp_ls_process_edge` fires `flog_err(EC_BGP_LS_PACKET, "BGP-LS: Edge missing source or destination vertex")` — even though the withdrawal was already correctly sent during A's `incoming_edges` loop. In a large topology this generates one false error per affected edge per vertex-deletion sequence.

Add a guard before each `bgp_ls_process_edge` call to skip edges that have already been partially disconnected:

```suggestion
			for (ALL_LIST_ELEMENTS_RO(vertex->outgoing_edges, lnode, e)) {
				if (e->source && e->destination)
					bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
			}
			for (ALL_LIST_ELEMENTS_RO(vertex->incoming_edges, lnode, e)) {
				if (e->source && e->destination)
					bgp_ls_process_edge(bgp, e, LS_MSG_EVENT_DELETE);
			}
```

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks valid

Comment on lines +1185 to +1191
if (msg->event == LS_MSG_EVENT_UPDATE) {
struct ls_edge *old_edge;
old_edge = ls_find_edge_by_source(bgp->ls_info->ted,
msg->data.attr);
if (old_edge)
bgp_ls_process_edge(bgp, old_edge, LS_MSG_EVENT_DELETE);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale withdrawal misses when edge key changes on UPDATE

ls_find_edge_by_source derives the lookup key from msg->data.attr (the new attributes). If the UPDATE changes the attribute that forms the edge key (e.g. the local interface address used by get_edge_key), the old edge in TED will have a different key and won't be found here — leaving the old BGP route unwithdrawn. ls_msg2edge will then insert a new edge under the new key, and the peer ends up holding both the stale old NLRI and the fresh new one.

A more robust approach is to look up the old edge using the existing key stored in TED before the attribute is updated, or to check whether ls_msg2edge returns a newly-inserted edge vs. an in-place update, and only send the pre-update withdraw when a key change is detected.

Prompt To Fix With AI
This is a comment left during a code review.
Path: bgpd/bgp_ls_ted.c
Line: 1185-1191

Comment:
**Stale withdrawal misses when edge key changes on UPDATE**

`ls_find_edge_by_source` derives the lookup key from `msg->data.attr` (the *new* attributes). If the UPDATE changes the attribute that forms the edge key (e.g. the local interface address used by `get_edge_key`), the old edge in TED will have a different key and won't be found here — leaving the old BGP route unwithdrawn. `ls_msg2edge` will then insert a new edge under the new key, and the peer ends up holding both the stale old NLRI and the fresh new one.

A more robust approach is to look up the old edge using the existing key stored in TED *before* the attribute is updated, or to check whether `ls_msg2edge` returns a newly-inserted edge vs. an in-place update, and only send the pre-update withdraw when a key change is detected.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks valid

@donaldsharp
Copy link
Copy Markdown
Member

@cscarpitta can you look at this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants