Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress nginx scaling to max due to memory #12167

Open
sivamalla42 opened this issue Oct 12, 2024 · 22 comments
Open

Ingress nginx scaling to max due to memory #12167

sivamalla42 opened this issue Oct 12, 2024 · 22 comments
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.

Comments

@sivamalla42
Copy link

Hi All,

We observe a strange behaviour with the ingress-nginx pods in our production. We started observing the pods scaling to max due to memory usage.
EKS: 1.29

helm list -n ingress-nginx NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION ingress-nginx ingress-nginx 1 2024-05-01 11:27:32.802401 +0530 IST deployed ingress-nginx-4.8.3 1.9.4

Not sure why all of a sudden we started observing this behaviour. There is no clue on why it started and how to fix it
If we are increasing the pods, the memory is still getting consumed and pods are scaling up again.

image

image

Any help is very much appreciated.

Thanks
Siva

@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority labels Oct 12, 2024
@longwuyuan
Copy link
Contributor

You can check the logs of the controller pods and hardcode the number of workers

@longwuyuan
Copy link
Contributor

/kind support

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Oct 12, 2024
@tao12345666333
Copy link
Member

Can you observe your request traffic? Have you encountered more requests or are there many large requests?

@sivamalla42
Copy link
Author

the controller pod logs have the data of the requests, but nothing specific with failures or OOM errors

@sivamalla42
Copy link
Author

sivamalla42 commented Oct 12, 2024

@tao12345666333 , we do not observe any abnormal traffic coming into the ingress layers. it looks to be regular traffic

@sivamalla42
Copy link
Author

You can check the logs of the controller pods and hardcode the number of workers

@longwuyuan , can you please elaborate a bit more on what needs to be done on hardcode the number of workers ?

@sivamalla42
Copy link
Author

sivamalla42 commented Oct 12, 2024

Also sending the n/w level in and out metrics

network

@longwuyuan
Copy link
Contributor

You can check the logs of the controller pods and hardcode the number of workers

@longwuyuan , can you please elaborate a bit more on what needs to be done on hardcode the number of workers ?

#8166

@sivamalla42
Copy link
Author

You can check the logs of the controller pods and hardcode the number of workers

@longwuyuan , can you please elaborate a bit more on what needs to be done on hardcode the number of workers ?

#8166

@longwuyuan , i see the below in the running ingress pod
worker_processes 16;

should this value be sufficient to continue

I tried manually reducing this worker_processes to 8 on few nodes and observe that the memory consumption looked to be reduced.

Please suggest

@Gacko
Copy link
Member

Gacko commented Oct 12, 2024

There are a few things coming into play here.

The static memory consumption of the Ingress NGINX Controller partially depends on your cluster size, so nodes and pods, and amount of Ingress resources.

In the past I observed Ingress NGINX Controller pods to consume up to 4 GB of memory right after startup because the cluster contained both a lot of nodes/pods and around 2,500 Ingress resources.

This memory consumption does still not take actual traffic into account and is a design flaw of our current implementation as the control plane consuming the memory for internal operations is in the same container as the data plane which is actually doing the heavy lifting.

If you now use HPA to scale your deployment and would expect it to do so depending on actual load produced by traffic, you might hit your target average memory utilization just with static data produced by how your environment (again, number of nodes, pods and Ingresses influence this) looks like.

This especially can become a problem when you start with resource and HPA settings for a smaller setup and then slowly grow to the before mentioned point.

Is the actual memory consumption this big right after pod startup or does it grow with time? The former would confirm my assumption while the latter could be caused by a memory leak.

For the former you will probably need to tweak your resource requests and/or HPA settings. Sadly we can not overcome this design flaw at the moment, but we are planning to split the controller into a control plane and a data plane in the future.

For the latter I'd recommend you to update to the latest stable release of our controller first, if not already on it, and verify again.

Regards
Marco

@longwuyuan
Copy link
Contributor

@sivamalla42 since your graph shows increase started after 9/24, then you have no other choice but to first look at all other helpful graphs and co-relate them to the log messages timestamps. Idea is to know if memory increased for handling requests or not.

@sivamalla42
Copy link
Author

@Gacko ,
Currently we are on eks 1.29
ingress-nginx : ingress-nginx-4.8.3 APP version: 1.9.4 .
Which version would you suggest to upgrade to ? Please suggest

@Gacko
Copy link
Member

Gacko commented Oct 13, 2024

Hey,

sorry, I missed this information in your initial issue description.

Well, at best you'd upgrade to v1.11.3. But it would be interesting to know if the memory consumption rises by time or is high from the very beginning.

Regards
Marco

@sivamalla42
Copy link
Author

@Gacko , the pods were consuming the memory over the time. When they are restarted, they were taking time to consume memory. but in case if we are adding more pods, they are right away starting to consume the memory.
We would like to try upgrading to v1.11.3 but instead going to the latest version and come across with new issues, we would like to upgrade to any laters version in v1.10.x. so please suggest on this

@Gacko
Copy link
Member

Gacko commented Oct 14, 2024

Hello,

but in case if we are adding more pods, they are right away starting to consume the memory.

This sounds like your cluster is just big and Ingress NGINX therefore consuming comparable lot static memory.

v1.10.x is out of support. You can of course just use v1.10.5, but this is up to you. We cannot make recommendations about versions to use other than the latest stable one.

Regards
Marco

@toredash
Copy link
Contributor

Are you using rate limits?

@sivamalla42
Copy link
Author

Are you using rate limits?

nope, we have not set the limits.

@sivamalla42
Copy link
Author

image

I tried to look at if there is any specific requests or open requests which were trying to consume the memory, but looking at the open connections and new connections, the requests currently were too less when compared to the requests which we observed before this issue started. As this is production, we wanted to analyze and find if there is any pattern which is causing this.

Also why does this behaviour started all of a sudden, we are trying to understand if its purely to do with ingress-nginx or with our application requests causing this behaviour at ingress

@sivamalla42
Copy link
Author

sivamalla42 commented Oct 23, 2024

@Gacko ,
I was looking at the reloads that are happening during the scaling today

47m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-hskpj      Created container controller
47m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-hskpj      Started container controller
47m         Normal   RELOAD              pod/ingress-nginx-controller-77dfd7769c-hskpj      NGINX reload triggered due to a change in configuration
40m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-lf6f6      Container image "public.ecr.aws/dynatrace/dynatrace-operator:v1.0.1" already present on machine
40m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-lf6f6      Created container install-oneagent
40m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-lf6f6      Started container install-oneagent
40m         Normal   Scheduled           pod/ingress-nginx-controller-77dfd7769c-lf6f6      Successfully assigned ingress-nginx/ingress-nginx-controller-77dfd7769c-lf6f6 to ip-10-200-106-28.ec2.internal
40m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-lf6f6      Container image "registry.k8s.io/ingress-nginx/controller:v1.9.4@sha256:5b161f051d017e55d358435f295f5e9a297e66158f136321d9b04520ec6c48a3" already present on machine
40m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-lf6f6      Created container controller
40m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-lf6f6      Started container controller
40m         Normal   RELOAD              pod/ingress-nginx-controller-77dfd7769c-lf6f6      NGINX reload triggered due to a change in configuration
31m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-m5zcb      Container image "public.ecr.aws/dynatrace/dynatrace-operator:v1.0.1" already present on machine
31m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-m5zcb      Created container install-oneagent
31m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-m5zcb      Started container install-oneagent
31m         Normal   Scheduled           pod/ingress-nginx-controller-77dfd7769c-m5zcb      Successfully assigned ingress-nginx/ingress-nginx-controller-77dfd7769c-m5zcb to ip-10-200-106-14.ec2.internal
31m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-m5zcb      Container image "registry.k8s.io/ingress-nginx/controller:v1.9.4@sha256:5b161f051d017e55d358435f295f5e9a297e66158f136321d9b04520ec6c48a3" already present on machine
31m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-m5zcb      Created container controller
31m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-m5zcb      Started container controller
31m         Normal   RELOAD              pod/ingress-nginx-controller-77dfd7769c-m5zcb      NGINX reload triggered due to a change in configuration
59m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-v2zwh      Container image "public.ecr.aws/dynatrace/dynatrace-operator:v1.0.1" already present on machine
59m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-v2zwh      Created container install-oneagent
59m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-v2zwh      Started container install-oneagent
59m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-v2zwh      Container image "registry.k8s.io/ingress-nginx/controller:v1.9.4@sha256:5b161f051d017e55d358435f295f5e9a297e66158f136321d9b04520ec6c48a3" already present on machine
59m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-v2zwh      Created container controller
59m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-v2zwh      Started container controller
59m         Normal   Scheduled           pod/ingress-nginx-controller-77dfd7769c-v2zwh      Successfully assigned ingress-nginx/ingress-nginx-controller-77dfd7769c-v2zwh to ip-10-200-106-220.ec2.internal
59m         Normal   RELOAD              pod/ingress-nginx-controller-77dfd7769c-v2zwh      NGINX reload triggered due to a change in configuration
18m         Normal   Scheduled           pod/ingress-nginx-controller-77dfd7769c-v9mgq      Successfully assigned ingress-nginx/ingress-nginx-controller-77dfd7769c-v9mgq to ip-10-200-104-222.ec2.internal
18m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-v9mgq      Container image "public.ecr.aws/dynatrace/dynatrace-operator:v1.0.1" already present on machine
18m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-v9mgq      Created container install-oneagent
18m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-v9mgq      Started container install-oneagent
18m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-v9mgq      Container image "registry.k8s.io/ingress-nginx/controller:v1.9.4@sha256:5b161f051d017e55d358435f295f5e9a297e66158f136321d9b04520ec6c48a3" already present on machine
18m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-v9mgq      Created container controller
18m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-v9mgq      Started container controller
18m         Normal   RELOAD              pod/ingress-nginx-controller-77dfd7769c-v9mgq      NGINX reload triggered due to a change in configuration
54m         Normal   Scheduled           pod/ingress-nginx-controller-77dfd7769c-vgwqg      Successfully assigned ingress-nginx/ingress-nginx-controller-77dfd7769c-vgwqg to ip-10-200-104-227.ec2.internal
54m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-vgwqg      Container image "public.ecr.aws/dynatrace/dynatrace-operator:v1.0.1" already present on machine
54m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-vgwqg      Created container install-oneagent
54m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-vgwqg      Started container install-oneagent
54m         Normal   Pulled              pod/ingress-nginx-controller-77dfd7769c-vgwqg      Container image "registry.k8s.io/ingress-nginx/controller:v1.9.4@sha256:5b161f051d017e55d358435f295f5e9a297e66158f136321d9b04520ec6c48a3" already present on machine
54m         Normal   Created             pod/ingress-nginx-controller-77dfd7769c-vgwqg      Created container controller
54m         Normal   Started             pod/ingress-nginx-controller-77dfd7769c-vgwqg      Started container controller
54m         Normal   RELOAD              pod/ingress-nginx-controller-77dfd7769c-vgwqg      NGINX reload triggered due to a change in configuration
31m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     (combined from similar events): Created pod: ingress-nginx-controller-77dfd7769c-86v8b
59m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-cj7vn
59m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-v2zwh
54m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-vgwqg
54m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-8htvc
47m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-hskpj
47m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-fj7lv
40m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-lf6f6
18m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-7ssn4
18m         Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-v9mgq
115s        Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-9lvn9
115s        Normal   SuccessfulCreate    replicaset/ingress-nginx-controller-77dfd7769c     Created pod: ingress-nginx-controller-77dfd7769c-g728k
59m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
59m         Normal   SuccessfulRescale   horizontalpodautoscaler/ingress-nginx-controller   New size: 36; reason: memory resource utilization (percentage of request) above target
59m         Normal   ScalingReplicaSet   deployment/ingress-nginx-controller                Scaled up replica set ingress-nginx-controller-77dfd7769c to 36 from 34
59m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
54m         Normal   SuccessfulRescale   horizontalpodautoscaler/ingress-nginx-controller   New size: 38; reason: memory resource utilization (percentage of request) above target
54m         Normal   ScalingReplicaSet   deployment/ingress-nginx-controller                Scaled up replica set ingress-nginx-controller-77dfd7769c to 38 from 36
54m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
54m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
48m         Normal   SuccessfulRescale   horizontalpodautoscaler/ingress-nginx-controller   New size: 40; reason: memory resource utilization (percentage of request) above target
48m         Normal   ScalingReplicaSet   deployment/ingress-nginx-controller                Scaled up replica set ingress-nginx-controller-77dfd7769c to 40 from 38
47m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
47m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
40m         Normal   SuccessfulRescale   horizontalpodautoscaler/ingress-nginx-controller   New size: 42; reason: memory resource utilization (percentage of request) above target
40m         Normal   ScalingReplicaSet   deployment/ingress-nginx-controller                Scaled up replica set ingress-nginx-controller-77dfd7769c to 42 from 40
40m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
40m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
31m         Normal   SuccessfulRescale   horizontalpodautoscaler/ingress-nginx-controller   New size: 44; reason: memory resource utilization (percentage of request) above target
31m         Normal   ScalingReplicaSet   deployment/ingress-nginx-controller                Scaled up replica set ingress-nginx-controller-77dfd7769c to 44 from 42
31m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
31m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
18m         Normal   SuccessfulRescale   horizontalpodautoscaler/ingress-nginx-controller   New size: 46; reason: memory resource utilization (percentage of request) above target
18m         Normal   ScalingReplicaSet   deployment/ingress-nginx-controller                Scaled up replica set ingress-nginx-controller-77dfd7769c to 46 from 44
18m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
18m         Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
115s        Normal   SuccessfulRescale   horizontalpodautoscaler/ingress-nginx-controller   New size: 48; reason: memory resource utilization (percentage of request) above target
115s        Normal   ScalingReplicaSet   deployment/ingress-nginx-controller                Scaled up replica set ingress-nginx-controller-77dfd7769c to 48 from 46
115s        Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller
114s        Normal   CREATE              configmap/ingress-nginx-controller                 ConfigMap ingress-nginx/ingress-nginx-controller```

Any insights into this if there is kind of cause that we can find out

@Gacko
Copy link
Member

Gacko commented Oct 23, 2024

Sorry, but I don't understand how this connects to my recent questions. I was asking you to investigate the static resource consumption right after you started a pod without any load. This gives insights into how much memory the controller uses just for the bare cluster state. If you already exceed or are close to your target average memory utilization in idle mode, than you will need to increase the memory requests.

As stated before: I know this is not perfect and we are targeting to solve this issue by splitting the controller into control plane and data plane.

Copy link

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/support Categorizes issue or PR as a support question. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one.
Projects
Development

No branches or pull requests

6 participants