-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Potential memory leak on rbd image copy #904
Comments
Going to give some context for any users that run into similar problem I have. Looking at the code you see how I handle the connection closing (via If I drop the ioctx.Destroy()
conn.Shutdown() It's really weird, it's like the |
@phlogistonjohn go-ceph indeed has a memory leak with how If you return a I have worked around this by just calling the Since the Basically, the connection I return after |
OK, thanks for the update. Without a lot of investigation on my part yet, an issue with Shutdown seems more plausible to me. I'm reopening this issue since it automatically got closed from the other PR. We'll look into it soon. |
It's very hard to reproduce. Heaptrack indicates it's only about ~6% of connections. I tried with a custom program that just did these checks endlessly and it never leaked. If I try on the program where I discovered this it takes 10+ hours for the leak to occur. I did my own custom changes to |
@shell-skrimp So, I'm a bit confused now. Which of your findings from above (defer vs direct call etc.) are still valid? Now it sounds here like there is a leak no matter what. To be honest, it even sounds like there might be a race within ceph itself. But I just started to look into this, so it's just a gut feeling. |
@ansiwen neither are valid. I thought that direct calling was better than What I did in my testing:
In the mean time I switched to a long lived ceph connection and that seems to have fixed issue for now. |
The tool I have written uses <
70MB
RSS to perform exact same operations for qcow2 images (basically raw copying them). I can do this in rapid succession and will never break100MB
. However, if I do the same thing with ceph (copying image in a pool) with exact same throughput as "qcow2" based implementation after all is said and done my tool is using1127M
RSS and it's never freed. The ramp up to this amount of memory takes a bit, initially it's only using about 40MB more than it would if I wasnt talking to ceph. So I think there is some type of memory leak, just not entirely sure on where to find it or how to track it down with go-ceph/cgo.I have used
memleak-bpfcc -p <pid>
on the process and then start a copy and see the following:There is also a less significant
So, what I'm trying to understand is if the allocations are reaped at some point or there is indeed a memory leak?
I have also checked through the pprof dumps and from what I can tell I am releasing memory properly (closing connections, ioctx, images, etc); uses about 512Kb on the heap after testing; so it appears that all the memory that's being used it outside the go runtime which leads me to believe its librados/go-ceph (especially since this started occurring once I added ceph support).
The code I use to do the copy:
To show that it's not anything I'm doing in the runtime:
If this is a memory leak, what more could I do to help get more info?
Using go 1.20 with go-ceph
v0.22.0
on a local quincy test cluster.The text was updated successfully, but these errors were encountered: