-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bind() to unix:/var/lib/nginx/nginx-config-version.sock failed (98: Address already in use) #4370
Comments
Hi @MarkTopping thanks for reporting! Be sure to check out the docs and the Contributing Guidelines while you wait for a human to take a look at this 🙂 Cheers! |
@MarkTopping |
Hi Jason, Thanks for responding. The deployment manifest: # Source: nginx-ingress/templates/controller-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ingresscontroller-nginx-ingress
namespace: ingress
labels:
helm.sh/chart: nginx-ingress-0.17.1
app.kubernetes.io/name: ingresscontroller-nginx-ingress
app.kubernetes.io/instance: ingresscontroller
app.kubernetes.io/version: "3.1.1"
app.kubernetes.io/managed-by: Helm
spec:
replicas: 4
selector:
matchLabels:
app.kubernetes.io/name: ingresscontroller-nginx-ingress
app.kubernetes.io/instance: ingresscontroller
app: ingresscontroller-nginx-ingress
template:
metadata:
labels:
app: ingresscontroller-nginx-ingress
app.kubernetes.io/name: ingresscontroller-nginx-ingress
app.kubernetes.io/instance: ingresscontroller
spec:
tolerations:
- effect: NoSchedule
key: purpose
operator: Equal
value: ingress
volumes:
- name: nginx-etc
emptyDir: {}
- name: nginx-cache
emptyDir: {}
- name: nginx-lib
emptyDir: {}
- name: nginx-log
emptyDir: {}
- csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes:
secretProviderClass: ingress
name: default-ingress
serviceAccountName: ingresscontroller-nginx-ingress
automountServiceAccountToken: true
securityContext:
seccompProfile:
type: RuntimeDefault
fsGroup: 101
terminationGracePeriodSeconds: 300
hostNetwork: false
dnsPolicy: ClusterFirst
containers:
- image: "[REDACTED]/nginx/nginx-ingress:3.1.1"
name: ingresscontroller-nginx-ingress
imagePullPolicy: "Always"
ports:
- name: http
containerPort: 80
- name: https
containerPort: 443
- name: readiness-port
containerPort: 8081
readinessProbe:
httpGet:
path: /nginx-ready
port: readiness-port
periodSeconds: 5
timeoutSeconds: 10
initialDelaySeconds: 10
resources:
limits:
cpu: 1000m
memory: 500Mi
requests:
cpu: 500m
memory: 500Mi
securityContext:
allowPrivilegeEscalation: true
runAsGroup: 2001
runAsNonRoot: true
runAsUser: 101
seccompProfile:
type: RuntimeDefault
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
add:
- NET_BIND_SERVICE
volumeMounts:
- mountPath: /etc/nginx
name: nginx-etc
- mountPath: /var/cache/nginx
name: nginx-cache
- mountPath: /var/lib/nginx
name: nginx-lib
- mountPath: /var/log/nginx
name: nginx-log
- mountPath: /mnt/keyvault
name: default-ingress
readOnly: true
env:
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
args:
- -nginx-plus=false
- -nginx-reload-timeout=60000
- -enable-app-protect=false
- -enable-app-protect-dos=false
- -nginx-configmaps=$(POD_NAMESPACE)/ingresscontroller-nginx-ingress
- -default-server-tls-secret=ingress/default-tls
- -ingress-class=nginx
- -health-status=true
- -health-status-uri=/nginx-health
- -nginx-debug=false
- -v=1
- -nginx-status=true
- -nginx-status-port=8080
- -nginx-status-allow-cidrs=127.0.0.1
- -report-ingress-status
- -external-service=ingresscontroller-nginx-ingress
- -enable-leader-election=true
- -leader-election-lock-name=ingresscontroller-nginx-ingress-leader-election
- -enable-prometheus-metrics=false
- -prometheus-metrics-listen-port=9113
- -prometheus-tls-secret=
- -enable-service-insight=false
- -service-insight-listen-port=9114
- -service-insight-tls-secret=
- -enable-custom-resources=false
- -enable-snippets=true
- -include-year=false
- -disable-ipv6=false
- -ready-status=true
- -ready-status-port=8081
- -enable-latency-metrics=false
initContainers:
- name: init-ingresscontroller-nginx-ingress
image: "[REDACTED]/nginx/nginx-ingress:3.1.1"
imagePullPolicy: "Always"
command: ['cp', '-vdR', '/etc/nginx/.', '/mnt/etc']
resources:
requests:
cpu: 25m
memory: 50Mi
limits:
cpu: 25m
memory: 50Mi
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsUser: 101
runAsNonRoot: true
capabilities:
drop:
- ALL
volumeMounts:
- mountPath: /mnt/etc
name: nginx-etc
minReadySeconds: 15 However I'm afraid I don't have additional log data. We chose not to scrape them and this isn't something I can happily replicate without causing reasonable disruption. In case this should happen again though and I have an opportunity to look for more logs what would you be after and where would I find them? I provided the container logs but I'm assuming you mean other logs from an additional source? During my investigation I inspected traffic logs for applications which Nginx fronts and the nginx controller was under load at the time. Additionally services which it fronts were undergoing scaling (via HPA). You mentioned NIC logs... I need some clarification there - do you mean traffic logs for the NIC on the underlying VM? I should note that I've since made two changes to the deployment which differ from what I've pasted above... we've set CPU Request == CPU Limit == 1000m and also tripled the Memory. From my end this should greatly reduce the likelihood of reoccurrence but obviously that doesn't negate the fact that we've stumped upon a bug/issue - hence the ticket |
Hi @MarkTopping, for the issue of unrelated deployments triggering reload, we have included a fix in the latest release to stop watching some unrelated resources https://github.com/nginxinc/kubernetes-ingress/releases/tag/v3.3.0 Can you let me know if it reduces the resource the ingress controller consumes in your cluster? As for the port binding error, I have only managed to get the pod killed under extreme resource limit, and then it restarted and worked properly as the resources freed up. Can you give me more information on how to reproduce the port binding error? |
HI @MarkTopping I hope you're doing well. We do see that in your deployment manifest you have In our testing we saw that, for example, files in As @haywoodsh mentioned, we have provided an update to release |
Thanks for your responses @shaun-nx and @haywoodsh; and my apologies for not responding sooner @haywoodsh - honestly it slipped my mind. Good news that you've found a potential fix. We've rolled out version 3.3.0 into all non-prod environments and shall progress into production next week I expect. At my end, since I made the changes to a) increase the resources, and b) reduce the number of worker services we've not seen the issue re-occur even on the old version (3.1.1). Alas it's going to be very difficult to know whether or not your fix actually remediates the issue without reverting our other changes and trying to force an OOMKill. If I find myself with time to spare I'll give it a go. Thank you once again for your attention |
Hi @MarkTopping I've opened a new issue which specifically talks about the behaviour of NGINX when it exists this way: #4604 We'd like to close this issue for now as the change in 3.3.0 provides measures to prevent NGINX from experiencing and OOM kill in the way you described, and use the above issue to track progress around the specific behaviour of NGINX. |
Untested but hopefully remediated as per shaun-nx's comments. thank you |
@MarkTopping this bug has been fixed now in #7121 |
Thank you very much for notifying me! |
Describe the bug
When ingress controller instances are OOM Killed then the containers fail (indefinatly) to restart. It seems they are unable to bind to a Port and one assumes that's because the port has not been released by the container what was OOM Killed.
Manual intervention in the form of restarting the Pods is required in order to bring the Ingress Controller instances back online.
For added context only... In our case, we think the OOM Kill was the result of an 'nginx config reload' which I believe was triggered on the back of an unrelated deployment auto-scaling and therefore Endpoints being updated. I've checked and I don't think any new Ingress resources were created at the time but obviously if they had been then this would be another reason for the reload.
From our end we are taking remedial steps by increasing the memory available to the Nginx IC pods and by decreasing the number of worker processes. But this only reduces the likelihood of the OOM Kill occurring....
I'm raising this issue because it should be possible for the nginx containers to restart without manual intervention if they are subject to an OOM Kill event.
To Reproduce
Steps to reproduce the behavior:
In our case we had 4 instances of nginx running each with a request/limit of 500Mi. Using a lower number of instances and of memory would make it easier to replicate the issue
See Nginx pods become 'Not Ready'.
View logs for any failing Pod
See error:
Expected behavior
I expect that when the nginx ingress controller container is restarted after it has been OOM Killed that it should be able to bind to the port(s) it requires and then start up successfully and without the need of a human having to restart the Pod(s).
Your environment
The text was updated successfully, but these errors were encountered: