Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out of memory issue #526

Closed
sathibault opened this issue Oct 18, 2019 · 6 comments
Closed

Out of memory issue #526

sathibault opened this issue Oct 18, 2019 · 6 comments

Comments

@sathibault
Copy link

I have a consumer that is consuming massive amounts of memory and being killed by the OS.

I've commented actual processing out, so the consumer is empty:

  await consumer.run({
    eachMessage: async record => {
    }
  });

From strace I see the following:

{"level":"debug","logger":"kafkajs","message":"Request Fetch(key: 1, version: 7)","broker":"stage-health-intelligence-kafka-3:9092","clientId":"boundary-service","correlationId":7,"expectResponse":true,"size":320,"timestamp":"2019-10-18T17:36:04.238Z"}
brk(0x57e4000)                          = 0x57e4000
brk(0x5be4000)                          = 0x5be4000
...
{"level":"debug","logger":"kafkajs","message":"Response Fetch(key: 1, version: 7)","broker":"stage-health-intelligence-kafka-2:9092","clientId":"boundary-service","correlationId":1,"size":1049050,"data":"[filtered]","timestamp":"2019-10-18T17:36:06.999Z"}
brk(0x1610dc000)                        = 0x1610dc000
brk(0x1611ce000)                        = 0x1611ce000

This are between two successive log messages, so the process size is growing from 0x57e4000 to 0x1610dc000 between these two messages.

Any leads on how I can further troubleshoot this?

@sathibault
Copy link
Author

I'm using kafkajs 1.11.0 with kafkajs-lz4 1.2.1

@JaapRood
Copy link
Collaborator

Normally I'd say any out of memory issues are through breaking back-pressure, but that doesn't hold up where memory grows 100x between 2 messages 😅. I don't really have any useful intuition about where else this might be coming from, but here's some suggestions on how I might try to narrow down on the root cause:

  • I'd try to create a memory profile through Chrome DevTools (by running node --inspect-brk). There's a chance the profiling won't work because of the massive jump in memory, but worth a try, as it should tell you where memory was allocated.
  • I'd try running against a different topic, possibly without the LZ4 compression, to either rule that out as a factor or highlight it as one.
  • I'd possibly try a different Node (and thus V8) version (which version are you on, actually). It's rare, but I have seen the occasional V8 bug throughout the years.

@tulios @Nevon have a more intimate knowledge of the internals, so perhaps they can be of more concrete help!

@Nevon
Copy link
Collaborator

Nevon commented Oct 21, 2019

I would strongly suspect lz4 being the culprit. We are running the same version in production (without compression) with stable memory consumption.

Screenshot 2019-10-21 at 10 34 16

While it's not impossible that it's something on our side related to perhaps some specific node version, it feels more likely to be lz4 given that it's a native lib with known memory leaks.

It would be good if you could try without compression and see if you have the same issue, just to verify whether or not the leak is coming from us or not.

@sathibault
Copy link
Author

Thanks. Unfortunately, I'm unable to disable lz4 on the topic and was unable to spend any more time debugging this. I've migrated to node-rdkafka.

@ankon
Copy link
Contributor

ankon commented Jun 21, 2020

While it's not impossible that it's something on our side related to perhaps some specific node version, it feels more likely to be lz4 given that it's a native lib with known memory leaks.

Mostly for the record: I saw this issue today, and found the idea of a leak in lz4 very scary -- we're also using lz4 compressed messages, and we have been seeing memory issues.

But, after some playing it looks like the lz4 leak in the linked issue is actually a problem in the benchmark: There simply wasn't any GCs running. See my comment over there for details.

For us this did bring up something else though: Looking at our batch processing logic I found the same issue as the test had! We would starve our event loop processing because our batch processing was essentially a tight loop of async/await code, and never did we give NodeJS time to process everything else until eventually the batch was processed. We've now fixed this problem by artifically introducing event loop runs (essentially we wrap our per-message processing in a setImmediate), and I'm going to hopefully see memory usage drops now as well.

@paras-nference
Copy link

@sathibault Hey, I am facing the same issue of memory leaking in my node app and I want to find out whether it's because of kafka only. So, I want to print the process size in error log like you are doing. I couldn't find how to enable this, can you please help me with this?
"level":"debug","logger":"kafkajs", ... ,"size":320, ....}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants