Step-CA fails on renewal #1799
Unanswered
MauriceMossIT
asked this question in
Q&A
Replies: 1 comment
-
Can anyone help with this? |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello,
Issue
I recently set up a new Step-CA server to work with a load balancer for ACME certificates using the HTTP-01 challenge. First time enrollments work well. However, trying to enroll or renew a certificate after that fails. I had this issue last year, but was never able to resolve it. I have some better information now, so I wanted to revisit it.
The error in the step-ca logs on the server is "response="{"type":"urn:ietf:params:acme:error:badNonce","detail":"Unacceptable anti-replay nonce"}"".
The load balancer originally worked for renewals when it was on an older code version. The newer version fails. The only change on the load balancer was an update to a newer version of the acme.sh script which I think might play a role.
Testing
Here is a snip of an enrollment, then another forced enrollment afterwards.
In this pcap, you can see the first request effectively complete at packet 6; however, the server doesn't close the connection and sends a keep alive every 15 seconds. Packet 11 shows the start of a new enrollment. There's no TCP handshake here because the kept alive session is first used.
This fails, and 30 seconds later the server sends a FIN/ACK. This is where I think the acme.sh version difference could be impactful. Note that between the FIN/ACK sent by the server and the next SYN sent, there is ~2.8 seconds of time during which the load balancer sends a RST/ACK.
After the RST and SYN, the connection continues and the certificate is successfully enrolled again (using --force).
Here is a snip from the load balancer on newer code versions with an updated acme.sh version.
The initial enrollment of a certificate with a newly started step-ca instance works just the same as the previous one. The longer duration between enrollments was simply a difference of when I triggered a new enrollment in this capture versus the previous one.
Looking at packet 21, we see the 2nd attempt at enrollment. Just like in the previous capture, 30 seconds later the server sends a FIN/ACK. However, this time the server sends a SYN ~1.4 seconds after the FIN/ACK. This is faster than the older version tests which I assume could be related to the updated acme.sh version. The load balancer sends a RST/ACK 2 seconds after the FIN (same amount of time in both captures). Then a new FIN/ACK is sent by the server and the enrollment fails.
Some additional testing was done from the load balancer using config that triggered the load balancer to immediately send a FIN/ACK after the 2nd enrollment attempt FIN/ACK was sent by the server. This occurs before the 2nd SYN and prevents the RST/ACK from ever being sent. You can see this in packet 18. This attempt also fails after the first enrollment.
The last bit of testing I did was to use tcpkill on the server side to send a RST every time the connection completes. Doing this, I was able to consistently enroll certificates without issue.
Based on the captures and testing I've done, it seems like the server requires a RST to occur before a new certificate can be enrolled. Otherwise the server needs to be restarted each time. It's possible immediately closing the connection with a FIN/ACK after each certificate could also resolve this; however, I was only able to test with RSTs and not with FIN/ACKs.
Any and all tests can be replicated, and I am more than happy to discuss/demonstrate this behavior over a call as well.
Questions
Is there any reason the step-ca server keeps the connection alive rather than closing it?
Is there a possible solution to this issue on the step-ca side?
This certainly does not appear to be an issue with the load balancer, and the load balancer works flawlessly with the public Let's Encrypt servers.
Beta Was this translation helpful? Give feedback.
All reactions