Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stub_status counter leak when killing old workers #99

Open
2 tasks done
mookie- opened this issue Sep 6, 2024 · 3 comments
Open
2 tasks done

stub_status counter leak when killing old workers #99

mookie- opened this issue Sep 6, 2024 · 3 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@mookie-
Copy link

mookie- commented Sep 6, 2024

Environment

Include the result of the following commands:

  • nginx -V
nginx version: nginx/1.27.1
built by gcc 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
built with OpenSSL 3.0.2 15 Mar 2022
TLS SNI support enabled
configure arguments: --prefix=/etc/nginx --sbin-path=/usr/sbin/nginx --modules-path=/usr/lib/nginx/modules --conf-path=/etc/nginx/nginx.conf --error-log-path=/var/log/nginx/error.log --http-log-path=/var/log/nginx/access.log --pid-path=/var/run/nginx.pid --lock-path=/var/run/nginx.lock --http-client-body-temp-path=/var/cache/nginx/client_temp --http-proxy-temp-path=/var/cache/nginx/proxy_temp --http-fastcgi-temp-path=/var/cache/nginx/fastcgi_temp --http-uwsgi-temp-path=/var/cache/nginx/uwsgi_temp --http-scgi-temp-path=/var/cache/nginx/scgi_temp --user=nginx --group=nginx --with-compat --with-file-aio --with-threads --with-http_addition_module --with-http_auth_request_module --with-http_dav_module --with-http_flv_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_mp4_module --with-http_random_index_module --with-http_realip_module --with-http_secure_link_module --with-http_slice_module --with-http_ssl_module --with-http_stub_status_module --with-http_sub_module --with-http_v2_module --with-http_v3_module --with-mail --with-mail_ssl_module --with-stream --with-stream_realip_module --with-stream_ssl_module --with-stream_ssl_preread_module --with-cc-opt='-g -O2 -ffile-prefix-map=/data/builder/debuild/nginx-1.27.1/debian/debuild-base/nginx-1.27.1=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fPIC' --with-ld-opt='-Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro -Wl,-z,now -Wl,--as-needed -pie'
  • uname -a
Linux lb 5.15.0-100-generic #110-Ubuntu SMP Wed Feb 7 13:27:48 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux 

Description

Describe the bug in full detail including expected and actual behavior.
Specify conditions that caused it. Provide the relevant part of nginx
configuration and debug log.

  • The bug is reproducible with the latest version of nginx
  • The nginx configuration is minimized to the smallest possible
    to reproduce the issue and doesn't contain third-party modules

Hello,

we've started killing (sending SIGTERM) "old" nginx workers (nginx: worker process is shutting down) as we have regular configuration changes and a lot of websocket connections.

Since we do this, the counters from stub_status are incorrect.

nginx status

> curl localhost/nginx_status
Active connections: 65369
server accepts handled requests
 1042173178 1042173178 5035465167
Reading: 0 Writing: 31968 Waiting: 5356

Adding up Writing and Waiting it's just 37324 instead of the 65369 "active connections". But even 37324 is too high. The correct number should be around this:

> ss | grep https | wc -l
7327

It's reproducable e.g. using echo.websocket.org:

nginx configuration

map $http_upgrade $connection_upgrade {
	default upgrade;
	''      close;
}

server {
	listen 1234;
	location / {
		proxy_set_header  Host echo.websocket.org;
		proxy_ssl_server_name on;
		proxy_ssl_name echo.websocket.org;
		proxy_http_version 1.1;
		proxy_set_header Upgrade $http_upgrade;
		proxy_set_header Connection $connection_upgrade;

		proxy_pass https://echo.websocket.org:443;
	}
	location = /basic_status {
		    stub_status;
	}
}

Test

Open 2 tabs in the browser with the URL: http://localhost:1234/.ws
Now there are 3 active connection (one connection is the request to /basic_status):

# curl localhost:1234/basic_status
Active connections: 3
server accepts handled requests
 4 4 4
Reading: 0 Writing: 3 Waiting: 0

After that, reload the nginx process and you can see a nginx: worker process is shutting down process:

# systemctl reload nginx

# ps aux | grep [n]ginx
root     2233850  0.0  0.0  55372  5652 ?        Ss   22:29   0:00 nginx: master process /usr/sbin/nginx -c /etc/nginx/nginx.conf
nginx 2233851  0.0  0.0  55896  6044 ?        S    22:29   0:00 nginx: worker process is shutting down
nginx 2233940  0.0  0.0  55868  5400 ?        S    22:30   0:00 nginx: worker process
nginx 2233941  0.0  0.0  55868  5240 ?        S    22:30   0:00 nginx: worker process

# kill 2233851

After we've killed the old process the websocket client in the browser will reconnect. After that we get two additional connection from stub_status even as the old connections are gone:

# curl localhost:1234/basic_status
Active connections: 5
server accepts handled requests
 7 7 7
Reading: 0 Writing: 5 Waiting: 0
@mookie- mookie- added the bug Something isn't working label Sep 6, 2024
@bavshin-f5
Copy link
Member

Only the worker processes are aware of the connections and can decrement the counters on connection closure. As soon as you terminate the process, all the connection state it held is gone, and we're no longer able to guarantee a valid state of the counters.

Consider using worker_shutdown_timeout instead.

@mookie-
Copy link
Author

mookie- commented Sep 9, 2024

Thank you for the feedback. According to the documentation I thought it would be ok:

Individual worker processes can be controlled with signals as well, though it is not required. The supported signals are:
TERM, INT | fast shutdown
[...]

I assumed sending a TERM will behave similar as when the worker shutdown timeout happens.

@bavshin-f5
Copy link
Member

I assumed sending a TERM will behave similar as when the worker shutdown timeout happens.

The documentation says that SIGTERM and SIGINT invoke fast shutdown but doesn't explain the difference between the "fast" and the "graceful" shutdown procedures.

Graceful shutdown closes the idle connections and waits until the remaining clients are served.
The worker_shutdown_timeout timer only runs during the graceful shutdown, assumes that it's fine to delay the exit and calls the appropriate close methods for the remaining connections. That includes updating the stats and sending the necessary protocol packets (GOAWAY frame for HTTP/2 and HTTP/3).

Fast shutdown does a very minimal set of things before exiting: it runs exit_process handlers for the configured modules, logs a message and that's it. Notably, that does not include closing the connections, sending any packets or doing anything else that may block or delay the exit. All the remaining cleanup is left to the OS.
I don't think we want to change this behavior, but clarifying the consequences in the documentation could be useful.

@Maryna-f5 Maryna-f5 added documentation Improvements or additions to documentation and removed bug Something isn't working labels Oct 3, 2024
@Maryna-f5 Maryna-f5 transferred this issue from nginx/nginx Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants