Skip to content

zebra: fix SRv6 encap lost during recursive nexthop resolution#21519

Open
dawkopagh wants to merge 1 commit intoFRRouting:masterfrom
dawkopagh:dkopec/zebra/fix-recursive-srv6-nexthop
Open

zebra: fix SRv6 encap lost during recursive nexthop resolution#21519
dawkopagh wants to merge 1 commit intoFRRouting:masterfrom
dawkopagh:dkopec/zebra/fix-recursive-srv6-nexthop

Conversation

@dawkopagh
Copy link
Copy Markdown

Description

When resolving a recursive nexthop, nexthop_set_resolved() copied MPLS labels from both the resolver's FIB nexthop (newhop) and the parent nexthop, but copied SRv6 info only from the parent. As a result, an IPv4 route whose nexthop resolved through an SRv6 VPN route was installed with encap mpls instead of encap seg6, silently breaking traffic forwarding.

Fix

Fixed by adding a newhop->nh_srv6 copy block before the existing nexthop->nh_srv6 block, mirroring the MPLS label stacking logic. Both seg6local action and seg6 SID stack are propagated, a sid_zero() guard prevents copying an uninitialised SID.

Testing

Before fix:

(Pdb) print(r1.run("ip route show vrf test1"))
198.0.0.150 nhid 12  encap mpls  16 via inet6 fe80::98de:f8ff:fe1b:5766 dev eth0 proto 196 metric 20

(Pdb) print(r1.run("ip -6 route show vrf test1"))
2600:1000::101 nhid 9  encap seg6 mode encap segs 1 [ 2001:db8:3:1:: ] via fe80::98de:f8ff:fe1b:5766 dev eth0 proto bgp metric 20 pref medium

(Pdb) print(r1.vtysh_cmd("show ip route vrf test1"))
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv4 unicast VRF test1:
S>  198.0.0.150/32 [1/0] via 2600:1000::101 (recursive), weight 1, 00:27:57
  *                        via fe80::98de:f8ff:fe1b:5766, eth0, label 16, weight 1, 00:27:57

(Pdb) print(r1.vtysh_cmd("show ipv6 route vrf test1"))
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv6 unicast VRF test1:
B>* 2600:1000::101/128 [20/0] via fe80::98de:f8ff:fe1b:5766, eth0 (vrf default), label 16, seg6 2001:db8:3:1::, weight 1, 00:39:51

After fix:

(Pdb) print(r1.run("ip route show vrf test1"))
198.0.0.150 nhid 12  encap seg6 mode encap segs 1 [ 2001:db8:3:1:: ] via inet6 fe80::d05d:bdff:fefd:2669 dev eth0 proto 196 metric 20

(Pdb) print(r1.run("ip -6 route show vrf test1"))
2600:1000::101 nhid 9  encap seg6 mode encap segs 1 [ 2001:db8:3:1:: ] via fe80::d05d:bdff:fefd:2669 dev eth0 proto bgp metric 20 pref medium

(Pdb) print(r1.vtysh_cmd("show ip route vrf test1"))
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv4 unicast VRF test1:
S>  198.0.0.150/32 [1/0] via 2600:1000::101 (recursive), weight 1, 00:01:12
  *                        via fe80::d05d:bdff:fefd:2669, eth0, label 16, seg6 2001:db8:3:1::, weight 1, 00:01:12

(Pdb) print(r1.vtysh_cmd("show ipv6 route vrf test1"))
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIPng, O - OSPFv3, I - IS-IS, B - BGP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
       F - PBR, f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

IPv6 unicast VRF test1:
B>* 2600:1000::101/128 [20/0] via fe80::d05d:bdff:fefd:2669, eth0 (vrf default), label 16, seg6 2001:db8:3:1::, weight 1, 00:01:17

When resolving a recursive nexthop, nexthop_set_resolved() copied MPLS
labels from both the resolver's FIB nexthop (newhop) and the parent
nexthop, but copied SRv6 info only from the parent.  As a result, an
IPv4 route whose nexthop resolved through an SRv6 VPN route was
installed with encap mpls instead of encap
seg6, silently breaking traffic forwarding.

Fix by adding a newhop->nh_srv6 copy block before the existing
nexthop->nh_srv6 block, mirroring the MPLS label stacking logic.
Both seg6local action and seg6 SID stack are propagated, a sid_zero()
guard prevents copying an uninitialised SID.

Signed-off-by: Dawid Kopec <dkopec@akamai.com>
@greptile-apps
Copy link
Copy Markdown

greptile-apps bot commented Apr 14, 2026

Greptile Summary

This PR fixes a bug in nexthop_set_resolved() where SRv6 encapsulation info from the FIB resolver nexthop (newhop->nh_srv6) was silently dropped during recursive nexthop resolution, causing an IPv4 route resolved through an SRv6 VPN route to be installed with encap mpls instead of encap seg6. The fix adds a newhop->nh_srv6 copy block mirroring the existing MPLS label-stacking logic, and a new topotest exercises the full scenario end-to-end.

  • P1 – heap buffer overflow: nexthop_add_srv6_seg6 in lib/nexthop.c only allocates seg6_segs when it is NULL; when both the new newhop block and the existing nexthop block call it on the same resolved_hop and the nexthop SID count exceeds the newhop SID count, the second call writes beyond the first call's allocation.
  • P2 – guard inconsistency: the existing nexthop->nh_srv6 block at line 1992 lacks the num_segs && and !sid_zero() guards that the new newhop block correctly adds, creating an asymmetry that can propagate empty or zero SIDs.

Confidence Score: 4/5

  • Safe for the primary single-SID scenario, but a P1 heap buffer overflow exists in the multi-SID overlap case introduced by the fix.
  • The core bug fix is correct and well-tested. However, the new newhop SRv6 block, when combined with the existing nexthop block, can call nexthop_add_srv6_seg6 twice on the same resolved_hop without reallocating seg6_segs, producing a heap overflow when the parent nexthop carries more SIDs than the resolver. This warrants attention before merge.
  • zebra/zebra_nhg.c — specifically the interaction between the new newhop SRv6 copy block (lines 1967-1982) and the existing nexthop SRv6 copy block (lines 1984-1996).

Important Files Changed

Filename Overview
zebra/zebra_nhg.c Adds newhop->nh_srv6 copy block to nexthop_set_resolved(), correctly fixing the SRv6 encap loss during recursive NH resolution; introduces a potential heap buffer overflow when both newhop and nexthop carry multi-SID seg lists of different lengths, and is inconsistent with the existing nexthop block's guards.
tests/topotests/bgp_srv6_recursive_nhop_encap/test_bgp_srv6_recursive_nhop_encap.py New regression test for the SRv6 recursive nexthop encap bug; correctly exercises the fixed code path, polls both RIB and kernel FIB, and verifies SID identity across recursive and resolved routes.
tests/topotests/bgp_srv6_recursive_nhop_encap/r1/frr.conf r1 FRR config with SRv6 locator, iBGP peering, VRF test1 import, and the static IPv4 route that triggers recursive resolution through the SRv6 VPN route; topology and address plan are consistent with the test expectations.
tests/topotests/bgp_srv6_recursive_nhop_encap/r2/frr.conf r2 FRR config that exports 2600:1000::101/128 as an SRv6 VPN route with SID 2001:db8:3:1::; locator, SID assignment, and RT/RD values are consistent with the test's EXPECTED_SID and r1's import policy.

Sequence Diagram

sequenceDiagram
    participant Z as zebra RIB
    participant R as nexthop_set_resolved()
    participant NH as nexthop_add_srv6_seg6()

    Z->>R: "resolve 198.0.0.150/32<br/>nexthop=2600:1000::101 (nh_srv6=NULL)<br/>newhop=FIB NH of SRv6 VPN route (nh_srv6=SID)"

    note over R: NEW: copy newhop->nh_srv6
    R->>NH: resolved_hop, newhop SID (N segs)
    NH-->>R: allocates seg6_segs[N], writes N SIDs

    note over R: existing: copy nexthop->nh_srv6
    R->>NH: resolved_hop, nexthop SID (M segs)
    alt "nexthop->nh_srv6 == NULL (bug-fix scenario)"
        NH-->>R: skipped — nh_srv6 is NULL
    else "both have SRv6 AND M > N (overflow risk)"
        NH-->>R: seg6_segs NOT reallocated, writes M SIDs into N-sized buffer ⚠️
    end

    R-->>Z: resolved_hop with correct seg6 encap (normal case)
Loading

Comments Outside Diff (2)

  1. zebra/zebra_nhg.c, line 1984-1995 (link)

    P1 Heap buffer overflow when nexthop has more SIDs than newhop

    nexthop_add_srv6_seg6 only allocates seg6_segs when it is NULL (see lib/nexthop.c:714). The new newhop block (lines 1975–1981) runs first and allocates seg6_segs sized for newhop->nh_srv6->seg6_segs->num_segs (call it N). When both newhop->nh_srv6 and nexthop->nh_srv6 have seg lists and nexthop->nh_srv6->seg6_segs->num_segs (M) > N, the second call at line 1993 skips reallocation, sets num_segs = M, and writes M struct in6_addr values into a buffer that only holds N — overflowing the heap by (M − N) × 16 bytes.

    The fix is to free and reallocate (or XREALLOC) seg6_segs before the second copy, analogously to how nexthop_add_labels uses XREALLOC in lib/nexthop.c:650–657.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: zebra/zebra_nhg.c
    Line: 1984-1995
    
    Comment:
    **Heap buffer overflow when nexthop has more SIDs than newhop**
    
    `nexthop_add_srv6_seg6` only allocates `seg6_segs` when it is `NULL` (see `lib/nexthop.c:714`). The new `newhop` block (lines 1975–1981) runs first and allocates `seg6_segs` sized for `newhop->nh_srv6->seg6_segs->num_segs` (call it N). When both `newhop->nh_srv6` and `nexthop->nh_srv6` have seg lists and `nexthop->nh_srv6->seg6_segs->num_segs` (M) > N, the second call at line 1993 skips reallocation, sets `num_segs = M`, and writes M `struct in6_addr` values into a buffer that only holds N — overflowing the heap by `(M − N) × 16` bytes.
    
    The fix is to free and reallocate (or XREALLOC) `seg6_segs` before the second copy, analogously to how `nexthop_add_labels` uses `XREALLOC` in `lib/nexthop.c:650–657`.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. zebra/zebra_nhg.c, line 1992-1995 (link)

    P2 Inconsistent guards on existing nexthop block

    The new newhop block (line 1975) guards nexthop_add_srv6_seg6 with num_segs && and !sid_zero(), matching the pattern in nexthop_copy_no_recurse (lib/nexthop.c:924-926). The existing nexthop->nh_srv6 block here only checks seg6_segs != NULL, so it can pass a zero-length or all-zero SID stack to the helper. Now that both blocks coexist the inconsistency is more visible; aligning the guard would also prevent a spurious second call that overwrites the first call's result with an uninitialised SID.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: zebra/zebra_nhg.c
    Line: 1992-1995
    
    Comment:
    **Inconsistent guards on existing nexthop block**
    
    The new `newhop` block (line 1975) guards `nexthop_add_srv6_seg6` with `num_segs &&` and `!sid_zero()`, matching the pattern in `nexthop_copy_no_recurse` (`lib/nexthop.c:924-926`). The existing `nexthop->nh_srv6` block here only checks `seg6_segs != NULL`, so it can pass a zero-length or all-zero SID stack to the helper. Now that both blocks coexist the inconsistency is more visible; aligning the guard would also prevent a spurious second call that overwrites the first call's result with an uninitialised SID.
    
    
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: zebra/zebra_nhg.c
Line: 1984-1995

Comment:
**Heap buffer overflow when nexthop has more SIDs than newhop**

`nexthop_add_srv6_seg6` only allocates `seg6_segs` when it is `NULL` (see `lib/nexthop.c:714`). The new `newhop` block (lines 1975–1981) runs first and allocates `seg6_segs` sized for `newhop->nh_srv6->seg6_segs->num_segs` (call it N). When both `newhop->nh_srv6` and `nexthop->nh_srv6` have seg lists and `nexthop->nh_srv6->seg6_segs->num_segs` (M) > N, the second call at line 1993 skips reallocation, sets `num_segs = M`, and writes M `struct in6_addr` values into a buffer that only holds N — overflowing the heap by `(M − N) × 16` bytes.

The fix is to free and reallocate (or XREALLOC) `seg6_segs` before the second copy, analogously to how `nexthop_add_labels` uses `XREALLOC` in `lib/nexthop.c:650–657`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: zebra/zebra_nhg.c
Line: 1992-1995

Comment:
**Inconsistent guards on existing nexthop block**

The new `newhop` block (line 1975) guards `nexthop_add_srv6_seg6` with `num_segs &&` and `!sid_zero()`, matching the pattern in `nexthop_copy_no_recurse` (`lib/nexthop.c:924-926`). The existing `nexthop->nh_srv6` block here only checks `seg6_segs != NULL`, so it can pass a zero-length or all-zero SID stack to the helper. Now that both blocks coexist the inconsistency is more visible; aligning the guard would also prevent a spurious second call that overwrites the first call's result with an uninitialised SID.

```suggestion
		if (nexthop->nh_srv6->seg6_segs &&
		    nexthop->nh_srv6->seg6_segs->num_segs &&
		    !sid_zero(nexthop->nh_srv6->seg6_segs))
			nexthop_add_srv6_seg6(resolved_hop, &nexthop->nh_srv6->seg6_segs->seg[0],
					      nexthop->nh_srv6->seg6_segs->num_segs,
					      nexthop->nh_srv6->seg6_segs->encap_behavior);
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "zebra: fix SRv6 encap lost during recurs..." | Re-trigger Greptile

@cscarpitta cscarpitta self-requested a review April 14, 2026 08:45
labels);

/* Copy SRv6 info from the resolved route's nexthop first, then
* overlay any SRv6 info from the parent nexthop (consistent with
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment here says "overlay", but it looks like the lib/nexthop sr apis being used just copy info; there's no merge code like the mpls stack code?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you are right, the comment was misleading, nexthop_add_srv6_seg6local and nexthop_add_srv6_seg6 just assign/copy fields directly. I'll update the comment accordingly

if (newhop->nh_srv6->seg6_segs &&
newhop->nh_srv6->seg6_segs->num_segs &&
!sid_zero(newhop->nh_srv6->seg6_segs))
nexthop_add_srv6_seg6(resolved_hop,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the greptile issue with these apis looks valid to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants