Skip to content

maOS peers.d cache can get stuck with no-path entries for LEAF peers; daemon never refreshes via PLANET WHOIS #2585

Description

@daniilino

TL;DR

On macOS Tahoe 26.3.1 with ZeroTier 1.16.1, on a NATed corporate LAN where this exact setup worked fine until ~7 days ago, LEAF peer discovery never completes. zerotier-cli listpeers shows LEAFs with path: -, latency: -1; /peer/<addr> returns paths: [], version: -1.-1.-1. The daemon also writes negative "no path" entries to peers.d/<ztaddr>.peer that survive every soft remedy (service restart, leave/join, uplink change). The forceTcpRelay workaround does establish a TCP connection to 204.80.128.1:443 but only sometimes actually forwards traffic.

Possibly related to #2584 (same OS major, same ZT version, also involves coexisting with another VPN), but the symptom and code path are different.

Environment

  • OS: macOS Tahoe 26.3.1, build 25D771280a (Apple Silicon, MacBook Pro)
  • ZeroTier: 1.16.1, installed from official .pkg. Binary mtime Dec 12 20:11:57 2025 — unchanged for ~5 months.
  • Network: Private, default flow rules (accept;), 5 members, all authorized and active < 1 min on Central throughout the issue.
  • Coexisting VPN (during initial diagnosis): Homebrew OpenVPN /opt/homebrew/sbin/openvpn using dev tun (classic tun driver, not NetworkExtension / NEPacketTunnelProvider). Split-tunnel (no redirect-gateway). Pushes specific 10.x routes only.

Timing — almost certainly a recent macOS regression

This exact environment (same Mac, same network, same ZT install, same OpenVPN config) worked fine until roughly May 5–6, 2026. Today is May 12.

log show --predicate 'subsystem == "com.apple.SoftwareUpdate"' --info --last 14d

…shows substantial OS-update activity on 2026-05-05 and 2026-05-06. ZeroTier itself was not updated in that window (stat of zerotier-one confirms Dec 12 20:11:57 2025). Strong correlation to a macOS Tahoe dot-release rather than to ZT.

Symptoms

1. LEAF peer paths never form on the corporate LAN

  • zerotier-cli infoONLINE
  • zerotier-cli listnetworksOK PRIVATE feth3618 10.147.18.235/24
  • Control plane to the network controller and several PLANETs works (with degraded freshness — lastReceive values of 100+ seconds for some PLANETs on the same uplink that's perfectly fine for everything else).
  • LEAF peers in listpeers:
    98da488706 - -1 - LEAF
    d624836ca0 - -1 - LEAF
    e751b8120a - -1 - LEAF
    
  • /peer/<ztaddr> returns:
    { "address": "98da488706", "paths": [], "version": "-1.-1.-1", "latency": -1, "role": "LEAF" }
  • After timeout, the daemon prunes the LEAF entries entirely from listpeers.
  • Kernel route for the peer's ZT IP picks up the REJECT flag (<UP,HOST,REJECT,...>) on the feth interface, and ping returns No route to host / Host is down.

This persists across:

  • launchctl kickstart -k system/com.zerotier.one
  • Killing OpenVPN entirely + waiting for utunX interfaces to drop
  • zerotier-cli leave 35c192ce9b9531d1 + zerotier-cli join (membership reverifies OK, identity preserved, still no LEAF paths)
  • Changing primary uplink while ZT is running

2. peers.d/ cache poisoning

Pre-wipe peers.d/ listing (during the failure window):

35c192ce9b.peer   May 12 16:33   82B   ← controller (was being refreshed)
778cde7190.peer   May 12 16:33   96B   ← PLANET (refreshed)
cafe04eba9.peer   May 12 16:33  117B   ← PLANET (refreshed, with full path)
cafe80ed74.peer   May 12 15:46   89B   ← PLANET (refreshed)
cafefd6717.peer   May 12 16:33   96B   ← PLANET (refreshed)
98da488706.peer   May 12 15:28   82B   ← LEAF — frozen at daemon's first-start mtime
d624836ca0.peer   May 12 15:28   82B   ← LEAF — frozen
e751b8120a.peer   May 12 15:28   82B   ← LEAF — frozen
ef6f1e242a.peer   May 12 16:00   82B
62f865ae71.peer   May 12 15:27   82B   ← orphan, not a member of any current network
cafe9efeb9.peer   May 12 15:27   82B   ← orphan PLANET

LEAF .peer files stay at the minimal 82-byte "identity-only, no path" size and at the daemon's first-start mtime for the entire failure window, while PLANET files are being rewritten every few minutes. Suggests the daemon writes a negative entry on the initial WHOIS-timeout and then doesn't re-attempt WHOIS for those peers on subsequent ticks. Two orphan entries are also present for peers no longer on this network.

Resolution:

sudo launchctl bootout system/com.zerotier.one
sudo pkill -9 zerotier-one
sudo rm -f "/Library/Application Support/ZeroTier/One/peers.d/"*
sudo launchctl bootstrap system /Library/LaunchDaemons/com.zerotier.one.plist

But: this only fully fixes things on a network where peer discovery can actually complete. On the broken corporate LAN, the cache wipe is followed by a fresh round of failed WHOIS, the LEAFs are pruned again, and we're back where we started. The cache wipe is therefore necessary but not sufficient.

3. forceTcpRelay works but converges very slowly

Config:

// /Library/Application Support/ZeroTier/One/local.conf
{ "settings": { "forceTcpRelay": true, "tcpFallbackRelay": "204.80.128.1/443" } }

After full reset (stop daemon, wipe peers.d/, write local.conf, bootstrap), the TCP relay socket is ESTABLISHED immediately:

zerotier-one  TCP 10.255.245.116:53210->204.80.128.1:443 (ESTABLISHED)

…and an independent nc -vz 204.80.128.1 443 succeeds. But the daemon shows info: TUNNELED, listnetworks: OK (netconf came through), and only the 4 PLANETs in listpeers with path: -, latency: -1no controller, no LEAFs — for 2+ minutes. The first attempt of the day after a fresh reboot can converge in ~75 s; subsequent attempts on the same uplink frequently need ~120–180 s of continuous outbound peer traffic (ICMP, application packets) before LEAFs and controller suddenly populate in listpeers with real paths.

Concretely measured today: 12 rounds of ping separated by 15 s sleeps; peer paths first appeared at round 9 (~2 minutes in), then stable from then on. Sample populated state:

35c192ce9b  35.208.198.255/21006;3317;3317  465 1.15.3 LEAF
98da488706  83.5.72.254/41558;10676;13984    334 1.16.1 LEAF
d624836ca0  83.5.72.254/33960;10676;10309    373 1.14.1 LEAF

Bidirectional ping then works at ~290 ms RTT, no loss. So the TCP relay path is functionally fine; it's the convergence latency that's surprising — and during those 2 minutes there's no signal at all in info / listpeers / API responses that progress is being made, so it looks identical to a stuck state.

Suggested daemon improvement: either expose a "TCP relay handshake/WHOIS retry" counter in /peer or /status so operators can tell convergence-in-progress from stuck, or shorten the WHOIS retry interval over the TCP fallback path.

What is not the cause

  • OpenVPN. Killing it (and removing all utunX interfaces it created) does not change the symptom.
  • NetworkExtension VPN. None is installed. The only NE-style extension on the system is Karabiner-DriverKit (keyboard).
  • Endpoint security / EDR / packet inspectors. None present. systemextensionsctl list shows only Karabiner. kmutil showloaded has only stock Apple kexts. No CrowdStrike, SentinelOne, Cisco, Netskope, Zscaler, GlobalProtect, etc.
  • pf / macOS Application Firewall. Both Disabled.
  • Corporate firewall. Same symptom reproduces on a totally different uplink (iPhone cellular hotspot — 172.20.10.x NAT, no corporate gear in path). And TCP/443 from corporate to 204.80.128.1 succeeds; only the data plane through the relay is unreliable.
  • Membership / authorization. Central shows all peers as active < 1 min continuously. Network status is OK. leave/join cycle works, returns to OK, doesn't fix peers.
  • Flow rules. Default (just accept;).
  • Moons. None configured.

Working workaround (use both steps)

  1. Switch uplink to a less restrictive network (iPhone hotspot is fine).
  2. Wipe peers.d/:
sudo launchctl bootout system/com.zerotier.one
sudo pkill -9 zerotier-one
sudo rm -f "/Library/Application Support/ZeroTier/One/peers.d/"*
sudo launchctl bootstrap system /Library/LaunchDaemons/com.zerotier.one.plist
sleep 25
zerotier-cli info && zerotier-cli listpeers && ping 10.147.18.115

On a network where UDP peer discovery works (e.g. cellular), this immediately restores connectivity (via PLANET relay first, then direct P2P paths form shortly after) and listpeers populates fully with real paths, latency, and versions. Switching back to the corporate LAN breaks it again, and the cycle repeats.

Hypothesis

Three failure modes that probably share an upstream cause:

  1. macOS Tahoe 26.3.x changed something (kernel networking, NE framework, packet-filter ordering, route-installation behavior, or similar) that makes ZT's UDP-based peer discovery / NAT-traversal fail far more often than it used to on certain networks. The same network was unproblematic 7 days ago.
  2. ZeroTier 1.16.1 writes negative peer-cache entries on a single WHOIS timeout and never re-attempts WHOIS for those peers until the file is deleted. This turns transient discovery failures into permanent ones. The user-facing symptoms persist across kickstart, leave/join, and uplink changes.
  3. ZeroTier's TCP fallback relay (forceTcpRelay) has at least one state where the outbound TCP socket to 204.80.128.1:443 is ESTABLISHED but the daemon does not actually pump control-plane / peer traffic through it. Identical config produces a working session one run and a broken session the next, with the same network and Mac state.

(1) is the macOS-side regression. (2) and (3) are robustness bugs in ZT that amplify it. Splitting (2) and (3) into separate issues if the maintainers prefer is fine.

Relationship to #2584

  • Same: macOS Tahoe 26, ZT 1.16.1.
  • Different: [Bug] macOS Tahoe 26: ZT peers unreachable when coexisting with NetworkExtension full-tunnel VPN (UniFi Teleport) #2584 requires an active NE-based VPN (UniFi Teleport / WireGuard via NEPacketTunnelProvider in full-tunnel mode); peer paths do form there (DIRECT/RELAY visible); outbound packets leave; inbound packets are dropped by NE. Disconnecting Teleport resolves it instantly.
  • Here: peer paths never form (paths: [], version: -1.-1.-1); no NE-based VPN involved; coexisting OpenVPN is irrelevant (kills don't help); persists across uplink changes; only peers.d/ wipe on a non-broken network restores connectivity, until you return to the broken network.

Plausibly both are downstream of the same macOS-26 networking change, manifesting in different ZT code paths.

Data I can provide

Happy to attach sysdiagnose, tcpdump captures during failure, zerotier-cli peer API dumps before/after peers.d/ wipe, full lsof output of zerotier-one in both working and broken states, or anything else useful. Reproducible reliably on my setup just by switching between the corporate LAN and cellular hotspot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions