-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unix domain socket not cleaned up if the router process is not terminated cleanly #1448
Comments
Hi, yes, this is something that has come up before. IIRC we have generally discounted simply trying to remove the socket at startup, for $reasons. However, here are some other things you may consider
If you are restarting the docker container when this happens then if you create the UNIX domain socket under a tmpfs(5) file-system (like
For example to make the router process exempt from OOM-kills # echo -1000 >/proc/$(pidof "unit: router")/oom_score_adj Maybe also include the See here for more details.
These are like virtual socket files and will be automatically cleaned up. Specify it like This may be an option if your client supports connecting to such a thing, e.g with the See the unix(7) man-page for more details. |
Thanks for the suggestions! We're considering using abstract sockets but they're unfortunately not supported in Nginx, or at least not well, so it looks like we would have to patch support for those into it ourselves as well which is not ideal. I noticed there's a few patches in mailing lists and Openresty issues floating around... Regarding restarting the container/removing the socket on startup to fix the issue - the app's environment is actually rebuilt entirely if the container's restarted so the socket does get cleaned up in that case. However, when the described issues occurs, Unit ends up in a bad state but continues running. We would actually prefer if it completely exited as that would then let the app recover automatically. I'm also wondering what contributes to the high memory usage of the router - does it increase with the number of open/pending connections or are there other factors (our config is relatively simple)? And do you know if the |
It would probably help to know what is actually causing the OOM situation. If it's something unrelated to Unit, then you could simply exempt Unit (the whole thing, all the unit processes) from being OOM-killed. If it's an application running under Unit, then increase its likelihood of being OOM-killed. If it is, Unit will re-spawn it. OK, ignore the below, it was using a regular TCP socket.
Unit may buffer request data. You'll see this when running it as a proxy. If you try and load a large file into an application, it may get buffered, exact behaviour and where the buffering occurs will be somewhat dependent on language module being used (and also the application itself). |
I'll paste one example of OOM killer's output when it killed the router process below. OOM killer was invoked because we have a memory limit (3GB) set on the Docker container that unit is running in. There's nothing else running in the container and OOM killer wasn't invoked due to high memory pressure on the host. It killed the process with PID 889451 (unitd router) which was using approximately 2GB of memory when it was killed. The worker processes (PHP application) were using very little memory in comparison, so it's unlikely that those caused the issue. I'm assuming that the router process simply started using more memory because of a traffic spike that occurred around the same time.
We're not as concerned with the OOM events, we can prevent, or at least limit the frequency of those by either raising memory limits (not ideal) or, as you suggested, by adjusting OOM scores for router processes. But it would still be good to improve unit's recovery in events like this - we've not encountered any other issues so far but it's likely not impossible for the router process to crash or get killed in some other way. Alternatively it'd also be useful if we could limit how much memory the router process is allowed to use, which would make it easier to predict what kinds of limits we can place on containers. Here's a minimal reproducible example for the
|
Hello,
We're running into an issue with nginx-unit, which is mostly caused by OOM-killer. Unit is running in a Docker container, and we have fairly strict memory and CPU constraints configured for it, which we don't want to remove. If a process in the container tries to allocate more memory than cgroup limits allow, OOM killer steps in and sends a SIGKILL signal to a (possibly random, haven't confirmed) process in the container/cgroup. If it kills the "router" process, then unit is unable to recover from that, returning the
bind(\"unix:/tmp/app-listener.unit.sock\") failed (98: Address already in use)
error when it starts up again (previously discussed in #669 and a few other issues).It'd be great if unit was able to recover gracefully from failures like this. We're currently testing the following patch which removes the socket if it already exists, before binding to it. This does work but not sure if it's a good idea:
Reproduction steps/example (it's also reproducible on 1.33.0):
I'm wondering if there's a better workaround for this issue and/or if this is a bug that you're open to addressing in the future?
The text was updated successfully, but these errors were encountered: