-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent connection loss with HANDSHAKE(REKEY_TIMEOUT) errors #363
Comments
anyone able to help here? |
Same issue here:
any ideas what could be? do you need any other info? |
I tried on fedora 36 and it’s even worse. Connection freezes after 5 seconds every time. Guess I’ll use tailscale |
Similar issue, except on Windows 10. Traffic drops for 1-2 minutes at a time, every 5-10 minutes or so. I can consistently ping google (8.8.8.8) and the cloudflare VPN endpoint this client is using (172.68.168.79), so doesn't seem like an issue with the internet connection. There are 4 other remote clients in this organization that have the same setting applied, but they don't experience the issue. However, those 4 clients are connecting from different locations and are connecting to different cloudflare VPN endpoints.
cfwarp_service_boring.txt shows:
About 30-50% of the time I also get an Incorrect sender index for responder gone packet error. But this message doesn't show up consistently when I experience the problem.
I'm currently trying to find a good way to re-home clients to different cloudflare VPN endpoints. Maybe this client's VPN endpoint (172.68.168.79) is just having a bad day? |
EDIT: This didn't actually assign a different "VPN endpoint". It just caused a change in the internal VPN routing, which was observable when running tracert to a tunneled subnet. See post on 12/11/2023 for a real way to change the VPN endpoint. I was able to get the client shifted over to another VPN endpoint by revoking access, restarting the warp service, and re-authorizing the device (not sure if both were necessary). After that the system was solid for about an hour. Then it started dropping out again with the same error.
Back to the drawing boarding... |
I setup ping monitors on the workstation. The problem seems to come and go. Sometimes the problem will occur every 5-10 minutes, other times there won't be any issues for a few hours. I took a closer look at the logs using Excel. Seems the protocol alternates between two overlapping sessions, which I'm calling Session A and Session B. Each session lasts 3 minutes, then waits 1 minute before re-keying a new session. Everything looks peachy in the logs below, until we get to line 55. We should expect to see the first keepalive for Session A, which would look something like this: Instead we get this around the same time we'd expect to get the keepalive: I'm not exactly sure if this REKEY_TIMEOUT is for session A or B. Session B is already 12 seconds into it's rekey cool-down period. But I put the rekey failures in the session B column anyways, since we still get a SESSION_EXPIRED log later on for session A. So the problem seems to be that this KEEPALIVE + REKEY_TIMEOUT timer is forcing boringtun to rekey session A while session B is in it's 1 minute rekey cool down period. For whatever reason, boringtun can't seem to rekey session A during the 1 minute cool-down. Maybe the server won't allow it to rekey early? Doesn't really matter, since this seems to be a symptom of the root problem.... which is what causes the KEEPALIVE + REKEY_TIMEOUT timer to trigger at an in-opportune time? |
In the Zero Trust admin center, under Settings > Authentication, I tried changing the Global session timeout to 1 Month. This had no effect, so I changed it back to Same as application session timeout. Been doing some more reading on the subject, and WireGuard states:
Looking at the log's timestamps, I believe REKEY_TIMEOUT = 5 seconds and KEEPALIVE = 25 seconds. WireGuard goes on to say:
This means we should expect to see the KEEPALIVE + REKEY_TIMEOUT timer go off if it's been 30+ seconds since we sent a packet and have had no response. So now the question becomes, why aren't we getting a response in that 30 second window?
Does anyone know how to tweak the REKEY_TIMEOUT and KEEPALIVE parameters? I don't see this anywhere in the Windows registry or config files. |
I've implemented some monitoring that allows me to see the number of KEEPALIVE + REKEY_TIMEOUT (KART) and standalone REKEY_TIMEOUT (RKTO) events. The workstation at issue has improved over the past week, but still experiences a significantly higher number of KART events from our other workstations... with the exception one other. This other PC (bottom of image) has 18x the number of KART events, but almost zero RKTO events and no network dropouts. This is interesting, because it shows the KART event can be handled gracefully, and doesn't need to cause packet loss. So now the question becomes, why can one workstation recover from a KART event in under 100ms, while another workstation is forced to wait until the aforementioned 1 minute cool-down expires? I tried shifting the workstation at issue over to the working PC's VPN endpoint. Maybe the original VPN endpoint doesn't like early rekeys?
If that doesn't work, I'll try one of the alternative ports. Maybe the ISP will handle traffic on UDP 500, 1701, or 4500 better. |
It's been a week, and we've had no more traffic dropouts. Changing the VPN endpoint seems to have done the trick. KEEPALIVE + REKEY_TIMEOUT (KART) events are down from 11.34/day to 1.16/day. More importantly, REKEY_TIMEOUT (RKTO) events are down from 147.1/day to 0/day. So maybe the issue wasn't with boringtun after all... Could be something on the Cloudflare VPN endpoint side. |
This issue seems to be related to this issue in Nord's fork of this repo. It has been resolved there. |
Nice find @cowlicks! I'm still seeing large numbers of "KEEPALIVE + REKEY_TIMEOUT" KART events and "REKEY_TIMEOUT" RKTO events, as shown below. The errors do not correlate with a particular VPN Endpoint, as I had speculated. The vast majority (83%) of our WARP clients are on version 24.3.409.0. Most people don't report connectivity issues, but a few do. Would be great if we could get Nord's PR implemented in Cloudflare's repo, as was mentioned here. Worst case scenario, this would go a long way to de-clutter the logs and help us pinpoint why some users have connectivity issues, while others don't. |
Preface:
Setup:
Issue:
I connect the warp client successfully and everything works (routes to my private network as well as internet routes). Eventually, sometimes minutes sometimes hours, all network connectivity dies. In this state I cannot access anything, including DNS and internet. In the below snippet of boringtun.log, the connectivity dies around 14:05:32 (timestamp collected by pinging google.com every second until the issue is observed).
The text was updated successfully, but these errors were encountered: