maOS peers.d cache can get stuck with no-path entries for LEAF peers; daemon never refreshes via PLANET WHOIS



## TL;DR

On macOS Tahoe 26.3.1 with ZeroTier 1.16.1, on a NATed corporate LAN where this exact setup worked fine until ~7 days ago, LEAF peer discovery never completes. `zerotier-cli listpeers` shows LEAFs with `path: -`, `latency: -1`; `/peer/<addr>` returns `paths: []`, `version: -1.-1.-1`. The daemon also writes negative "no path" entries to `peers.d/<ztaddr>.peer` that survive every soft remedy (service restart, `leave`/`join`, uplink change). The `forceTcpRelay` workaround does establish a TCP connection to `204.80.128.1:443` but only sometimes actually forwards traffic.

Possibly related to #2584 (same OS major, same ZT version, also involves coexisting with another VPN), but the symptom and code path are different.

## Environment

- **OS:** macOS **Tahoe 26.3.1**, build `25D771280a` (Apple Silicon, MacBook Pro)
- **ZeroTier:** **1.16.1**, installed from official `.pkg`. Binary mtime `Dec 12 20:11:57 2025` — unchanged for ~5 months.
- **Network:** Private, default flow rules (`accept;`), 5 members, all authorized and `active < 1 min` on Central throughout the issue.
- **Coexisting VPN (during initial diagnosis):** Homebrew OpenVPN `/opt/homebrew/sbin/openvpn` using `dev tun` (classic tun driver, **not** NetworkExtension / NEPacketTunnelProvider). Split-tunnel (no `redirect-gateway`). Pushes specific `10.x` routes only.

## Timing — almost certainly a recent macOS regression

This exact environment (same Mac, same network, same ZT install, same OpenVPN config) worked fine until **roughly May 5–6, 2026**. Today is May 12.

```
log show --predicate 'subsystem == "com.apple.SoftwareUpdate"' --info --last 14d
```

…shows substantial OS-update activity on `2026-05-05` and `2026-05-06`. ZeroTier itself was not updated in that window (`stat` of `zerotier-one` confirms `Dec 12 20:11:57 2025`). Strong correlation to a macOS Tahoe dot-release rather than to ZT.

## Symptoms

### 1. LEAF peer paths never form on the corporate LAN

- `zerotier-cli info` → `ONLINE`
- `zerotier-cli listnetworks` → `OK PRIVATE feth3618 10.147.18.235/24`
- Control plane to the network controller and several PLANETs works (with degraded freshness — `lastReceive` values of 100+ seconds for some PLANETs on the same uplink that's perfectly fine for everything else).
- LEAF peers in `listpeers`:
  ```
  98da488706 - -1 - LEAF
  d624836ca0 - -1 - LEAF
  e751b8120a - -1 - LEAF
  ```
- `/peer/<ztaddr>` returns:
  ```json
  { "address": "98da488706", "paths": [], "version": "-1.-1.-1", "latency": -1, "role": "LEAF" }
  ```
- After timeout, the daemon prunes the LEAF entries entirely from `listpeers`.
- Kernel route for the peer's ZT IP picks up the `REJECT` flag (`<UP,HOST,REJECT,...>`) on the `feth` interface, and `ping` returns `No route to host` / `Host is down`.

This persists across:
- `launchctl kickstart -k system/com.zerotier.one`
- Killing OpenVPN entirely + waiting for `utunX` interfaces to drop
- `zerotier-cli leave 35c192ce9b9531d1` + `zerotier-cli join` (membership reverifies `OK`, identity preserved, still no LEAF paths)
- Changing primary uplink while ZT is running

### 2. `peers.d/` cache poisoning

Pre-wipe `peers.d/` listing (during the failure window):

```
35c192ce9b.peer   May 12 16:33   82B   ← controller (was being refreshed)
778cde7190.peer   May 12 16:33   96B   ← PLANET (refreshed)
cafe04eba9.peer   May 12 16:33  117B   ← PLANET (refreshed, with full path)
cafe80ed74.peer   May 12 15:46   89B   ← PLANET (refreshed)
cafefd6717.peer   May 12 16:33   96B   ← PLANET (refreshed)
98da488706.peer   May 12 15:28   82B   ← LEAF — frozen at daemon's first-start mtime
d624836ca0.peer   May 12 15:28   82B   ← LEAF — frozen
e751b8120a.peer   May 12 15:28   82B   ← LEAF — frozen
ef6f1e242a.peer   May 12 16:00   82B
62f865ae71.peer   May 12 15:27   82B   ← orphan, not a member of any current network
cafe9efeb9.peer   May 12 15:27   82B   ← orphan PLANET
```

LEAF `.peer` files stay at the minimal 82-byte "identity-only, no path" size and at the daemon's first-start mtime for the entire failure window, while PLANET files are being rewritten every few minutes. Suggests the daemon writes a negative entry on the initial WHOIS-timeout and then doesn't re-attempt WHOIS for those peers on subsequent ticks. Two orphan entries are also present for peers no longer on this network.

Resolution:

```sh
sudo launchctl bootout system/com.zerotier.one
sudo pkill -9 zerotier-one
sudo rm -f "/Library/Application Support/ZeroTier/One/peers.d/"*
sudo launchctl bootstrap system /Library/LaunchDaemons/com.zerotier.one.plist
```

**But:** this only fully fixes things on a network where peer discovery can actually complete. On the broken corporate LAN, the cache wipe is followed by a fresh round of failed WHOIS, the LEAFs are pruned again, and we're back where we started. The cache wipe is therefore necessary but not sufficient.
### 3. `forceTcpRelay` works but converges very slowly

Config:

```jsonc
// /Library/Application Support/ZeroTier/One/local.conf
{ "settings": { "forceTcpRelay": true, "tcpFallbackRelay": "204.80.128.1/443" } }
```

After full reset (stop daemon, wipe `peers.d/`, write `local.conf`, bootstrap), the TCP relay socket is `ESTABLISHED` immediately:

```
zerotier-one  TCP 10.255.245.116:53210->204.80.128.1:443 (ESTABLISHED)
```

…and an independent `nc -vz 204.80.128.1 443` succeeds. But the daemon shows `info: TUNNELED`, `listnetworks: OK` (netconf came through), and *only* the 4 PLANETs in `listpeers` with `path: -, latency: -1` — **no controller, no LEAFs** — for **2+ minutes**. The first attempt of the day after a fresh reboot can converge in ~75 s; subsequent attempts on the same uplink frequently need ~120–180 s of continuous outbound peer traffic (ICMP, application packets) before LEAFs and controller suddenly populate in `listpeers` with real paths.

Concretely measured today: 12 rounds of `ping` separated by 15 s sleeps; peer paths first appeared at round 9 (~2 minutes in), then stable from then on. Sample populated state:

```
35c192ce9b  35.208.198.255/21006;3317;3317  465 1.15.3 LEAF
98da488706  83.5.72.254/41558;10676;13984    334 1.16.1 LEAF
d624836ca0  83.5.72.254/33960;10676;10309    373 1.14.1 LEAF
```

Bidirectional ping then works at ~290 ms RTT, no loss. So the TCP relay path is functionally fine; it's the convergence latency that's surprising — and during those 2 minutes there's no signal at all in `info` / `listpeers` / API responses that progress is being made, so it looks identical to a stuck state.

Suggested daemon improvement: either expose a "TCP relay handshake/WHOIS retry" counter in `/peer` or `/status` so operators can tell convergence-in-progress from stuck, or shorten the WHOIS retry interval over the TCP fallback path.

## What is *not* the cause

- **OpenVPN.** Killing it (and removing all `utunX` interfaces it created) does not change the symptom.
- **NetworkExtension VPN.** None is installed. The only NE-style extension on the system is Karabiner-DriverKit (keyboard).
- **Endpoint security / EDR / packet inspectors.** None present. `systemextensionsctl list` shows only Karabiner. `kmutil showloaded` has only stock Apple kexts. No CrowdStrike, SentinelOne, Cisco, Netskope, Zscaler, GlobalProtect, etc.
- **pf / macOS Application Firewall.** Both `Disabled`.
- **Corporate firewall.** Same symptom reproduces on a totally different uplink (iPhone cellular hotspot — `172.20.10.x` NAT, no corporate gear in path). And TCP/443 from corporate to `204.80.128.1` succeeds; only the data plane through the relay is unreliable.
- **Membership / authorization.** Central shows all peers as `active < 1 min` continuously. Network status is `OK`. `leave`/`join` cycle works, returns to `OK`, doesn't fix peers.
- **Flow rules.** Default (just `accept;`).
- **Moons.** None configured.

## Working workaround (use *both* steps)

1. Switch uplink to a less restrictive network (iPhone hotspot is fine).
2. Wipe `peers.d/`:

```sh
sudo launchctl bootout system/com.zerotier.one
sudo pkill -9 zerotier-one
sudo rm -f "/Library/Application Support/ZeroTier/One/peers.d/"*
sudo launchctl bootstrap system /Library/LaunchDaemons/com.zerotier.one.plist
sleep 25
zerotier-cli info && zerotier-cli listpeers && ping 10.147.18.115
```

On a network where UDP peer discovery works (e.g. cellular), this immediately restores connectivity (via PLANET relay first, then direct P2P paths form shortly after) and `listpeers` populates fully with real paths, latency, and versions. Switching back to the corporate LAN breaks it again, and the cycle repeats.

## Hypothesis

Three failure modes that probably share an upstream cause:

1. **macOS Tahoe 26.3.x changed something** (kernel networking, NE framework, packet-filter ordering, route-installation behavior, or similar) that makes ZT's UDP-based peer discovery / NAT-traversal fail far more often than it used to on certain networks. The same network was unproblematic 7 days ago.
2. **ZeroTier 1.16.1 writes negative peer-cache entries on a single WHOIS timeout and never re-attempts WHOIS for those peers** until the file is deleted. This turns transient discovery failures into permanent ones. The user-facing symptoms persist across `kickstart`, `leave`/`join`, and uplink changes.
3. **ZeroTier's TCP fallback relay (`forceTcpRelay`)** has at least one state where the outbound TCP socket to `204.80.128.1:443` is `ESTABLISHED` but the daemon does not actually pump control-plane / peer traffic through it. Identical config produces a working session one run and a broken session the next, with the same network and Mac state.

(1) is the macOS-side regression. (2) and (3) are robustness bugs in ZT that amplify it. Splitting (2) and (3) into separate issues if the maintainers prefer is fine.

## Relationship to #2584

- **Same:** macOS Tahoe 26, ZT 1.16.1.
- **Different:** #2584 requires an active NE-based VPN (UniFi Teleport / WireGuard via `NEPacketTunnelProvider` in full-tunnel mode); peer paths *do* form there (`DIRECT`/`RELAY` visible); outbound packets leave; **inbound packets are dropped by NE**. Disconnecting Teleport resolves it instantly.
- Here: peer paths *never form* (`paths: []`, `version: -1.-1.-1`); no NE-based VPN involved; coexisting OpenVPN is irrelevant (kills don't help); persists across uplink changes; only `peers.d/` wipe **on a non-broken network** restores connectivity, until you return to the broken network.

Plausibly both are downstream of the same macOS-26 networking change, manifesting in different ZT code paths.

## Data I can provide

Happy to attach `sysdiagnose`, `tcpdump` captures during failure, `zerotier-cli peer` API dumps before/after `peers.d/` wipe, full `lsof` output of `zerotier-one` in both working and broken states, or anything else useful. Reproducible reliably on my setup just by switching between the corporate LAN and cellular hotspot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

maOS peers.d cache can get stuck with no-path entries for LEAF peers; daemon never refreshes via PLANET WHOIS #2585

TL;DR

Environment

Timing — almost certainly a recent macOS regression

Symptoms

1. LEAF peer paths never form on the corporate LAN

2. `peers.d/` cache poisoning

3. `forceTcpRelay` works but converges very slowly

What is not the cause

Working workaround (use both steps)

Hypothesis

Relationship to #2584

Data I can provide

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

maOS peers.d cache can get stuck with no-path entries for LEAF peers; daemon never refreshes via PLANET WHOIS #2585

Description

TL;DR

Environment

Timing — almost certainly a recent macOS regression

Symptoms

1. LEAF peer paths never form on the corporate LAN

2. peers.d/ cache poisoning

3. forceTcpRelay works but converges very slowly

What is not the cause

Working workaround (use both steps)

Hypothesis

Relationship to #2584

Data I can provide

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

2. `peers.d/` cache poisoning

3. `forceTcpRelay` works but converges very slowly