-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packets sent to primary when both primary and secondary destinations are unhealthy #77
Comments
@brian-pane 👋 Thanks for writing this issue! Yes, your understanding here is exactly correct. We discussed a but of the design around this in our blog post announcing the director - we've historically found that multiple machine failure was rare enough that having 2 servers per hash entry was enough (though obviously more would be better). Indeed, we always send them to the "most healthy" of the 2 servers in the hope that one succeeds, but in the case where both are down the primary is selected and that may not work. We initially did all this with a pattern that was less extensible and didn't support more than 2 servers, but by the open source release it got to a point where everything except the forwarding table supports N servers rather than just 2. The only limitation to N is really the size allowable for the encapsulation GUE packet - on our networks we run with jumboframes but internet traffic is <1500, so there's plenty of room for more. Ideally in this case, the list of servers would be ordered, then all unhealthy servers would be moved to the end of the list, with the highest priority healthy server getting first dibs, followed by any other healthy servers, finally followed by unhealthy servers in order (in the hope they might actually still be working and complete connections). We've also talked about adding failure domain tags to servers so we can avoid 2 machines in the same row sharing a failure domain (for example, same rack). This makes the probability less likely, but doesn't change the N. This adds some complexity around how fill/drain works and makes the weights per server a little less random, though, so we haven't implemented a fully complete version yet. I'd be 💯 to patches supporting changing the size of the rows in the forwarding table, as long as it's configurable and there's generally a migration path (maybe versioning the forwarding table file or similar). |
@theojulienne wondering if it would be simpler if a row had a status and in the case of an entire row being dead, bump the connection to the next available row (defining 'next row' as a modulo-max-rows operation so that it is circular)? I see that reviving the dead nodes/dead row will be trickier with the above though. |
Yea, that's the problem - if you send it to a wider range of nodes without always falling back to them or having a quiesce period (which adds more complexity), you end up causing all connections to disconnect when machines become healthy again. That's why increasing the table size to N entries and always sending packets with those N alternatives (ordered based on chance of success) is probably the simplest. |
@theojulienne It would still be an availability question. It is up for debate - whether to solve this problem at the cost of extra complexity or return an error (5XX) to the user. Depends on how often we face this. Also required is an evaluation of the extra cost . The following is a possibility if we are already able to determine when we are starting new connections AND when we are terminating a connection at the director: Let's say we had this:
Then,
The movement from UNAVAILABLE to READY will likely be handled by a thread within the director interfacing with healthcheck. |
The glb-director is stateless, it doesn't know when connections are created or destroyed and doesn't maintain any TCP connection state, which is part of the design for allowing BGP route changes to direct connections to different glb-director nodes with zero impact. This is discussed a bit in why not LVS and how no state is stored on director. I think adding extra servers to the forwarding table rows is a significantly simpler change, in that it only adds extra potential hops and doesn't change any of the design constraints of the system, and it buys all the benefits you'd potentially get from un-isolating rows (essentially, adding more potential servers to work around failures). |
Hi Theo
I’ve already got this done & tested manually. In the process of updating unit-tests. Following that I’ll create a PR for this. Regards |
The rest is moot :) My bad. I was excited, then blinded by what seemed like a simple solution to me to use the next row in the forwarding table. I wrongly assumed that we know at the director when new connections are being made (with no further state preservation) so that they can be directed at the secondary (which becomes the new primary) when primary is drained and missed that the redirection is managed via iptables forwarding at the proxies themselves. |
While doing some testing, @ravisinghsfbay and @robloxrob and I ran into a case where glb-director forwards ingress packets to a known unhealthy destination. We're interested in submitting a patch to fix this case, but first I'm hoping to get a quick confirmation that the root-cause analysis below is correct.
The test setup was:
Expected result:
Observed result:
I verified that glb-healthcheck detected the failed healthchecks and generated a
forwarding_table.checked.bin
that properly identified the affected servers as state=1 and health=0.From a quick read through the code, I think the problem is in the construction of the forwarding table in
glb-director/cli/main.c
. There's some code checks for the case where the primary is unhealthy and the secondary is healthy. But if both the primary and the secondary are unhealthy, that entry in the forwarding table ends up with two unhealthy destinations in it. In the data plane, if an incoming packet hashes to that entry in the forwarding table, glb-director then sends it to the (unhealthy) primary, with a GUE wrapper that lists the (unhealthy) secondary.Assuming that analysis is correct, the first idea that springs to mind is to increase the number of destination IPs stored in each routing table slot from 2 to a configurable N. That wouldn't prevent the problem altogether, but it would reduce the probability of the problem as N increased.
The text was updated successfully, but these errors were encountered: