-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LXCFS crashes, need help diagnosing #644
Comments
@mihalicyn Since you self-assigned I'm hoping this is something which can be looked at and resolved, as we are now seeing this happen on 2 additional systems. What the systems have in common:
Here is a dump from another machine where this happened, not sure if you can tell if this is an identical crash or different. If it's different, let me know and I'll start collecting crash dumps from the various instances so you can get a complete picture: Crash data
|
Hi @webdock-io Thanks for your report! I'm working on your case right now. Second one is interesting:
It looks like for some reason, Line 1645 in 62c7230
I'm continuing investigation. |
@mihalicyn That's amazing! Much appreciated. I will collect crash dumps from other hosts where this has happened, I think we have seen this across 4 hosts thus far, and post them here when I have the chance - in case they are somehow different from what you've seen here already. Cheers :) |
Here are the remainder of crash dumps we've seen so far, so you can check if this is the same or different. Crash data 1
Crash data 2
Crash data 3
|
Hm, interesting. All 3 cases are a bit different, but root cause is the same. Invalid value of |
@webdock-io can you please show |
Sure - all our instances with the issue are identical to this:
|
Did this problem appear recently? After a kernel upgrade? Or after LXD upgrade? (or both?..) |
Well yes and no... These are not old / upgraded systems.. they are all brand new. All our newly deployed systems are on Noble and the latest kernel and latest stable lxd version. These are also systems where we are running containers inside VMs which we haven't done before So this is all new and the first time we run significant number of containers and load on these new systems We are not seeing this on any of our old systems We did do a lot of load testing before deployment and didn't see this happen. But we may have missed it as we weren't actively looking at lxcfs, rather at system stability |
@webdock-io thanks for providing an extra context on this ;-) Can you try to collect a crash dump from the LXCFS? All you need to do is (as root):
and wait until a next crash. Then collect a crashdump file in |
Hello. We have now run the command on all thus-far affected systems. If it happens again in the same location(s) we will have a dump for you |
Small update: I'm really not sure we will see this type of crash again. We identified a config issue on our systems where our ZFS ARC cache limit was set way too low. This was causing abnormally high IOWAIT on our systems which seems to have been the root cause of our problems (we were also seeing system halts/crashes). After we upped our ARC cache limit to something more reasonable, we see really good behavior everywhere and thus far no lxcfs or system crashes. This may point to something on the systems or in the kernel, due to the io bottleneck, getting messed up. It's only been a couple of days since we made this change, so the issue may not be completely solved - but so far it's looking good. We will be keeping the crash dump collection flag turned on and will of course revert here if we see it happening again. Thank you for your time |
Even when IOWAIT times are high, crashes are still crashes and we should fix it. So, you have a workaround but issue is still there. So I would not ignore it, and I would try to revert this workaround and try to get a reproduction and a crashdump so we can try to figure out what is wrong from the LXCFS/libfuse/kernel side which leads to a crashes like you've seen before.
so, not only LXCFS was crashing? What else was crashing then? |
Hello. Yes I agree, crashes are still crashes. I think it may be difficult to get a reproducer on this, but we will have a system free soon where I can attempt to recreate the issue for you, but no guarantees.
I do not want to confuse the matter. This was more of a statement that we saw other things misbehaving, namely our virtual machines. As mentioned earlier we are running these workloads inside LXD VMs - so, nested LXD if you will. Here we saw the VMs hang - no crash output, nothing in syslog no explanation that we could find, the VM would just go into a halt state and we would need to restart it to get it operational again. This did not coincide with the lxcfs crashes we saw, so they are not related per se but the root cause was the same, we think. After we made the ARC cache happy and alleviated the bottleneck, we have not seen such a crash in 48+ hours now, where before we saw 1-2 crashes every 24 hours. So we are hopeful we have mitigated that issue as well. Although, a full VM crash is worrying to say the least, and as is the case with the lxcfs crash we haven't really fixed the issue, we've just resolved what was provoking the issue (we think/hope) |
Hello We have now seen this happen on a completely new host, which had our "workaround" implemented. So this crash is still happening and is still real. However, we have had to remove your crash dump reporting flag from the hosts where we did have it activated as it was filling up our disk! We are a public cloud hosting lxd container instances, and whatever crashes were happening in customer containers were being dumped to /var/crash. Not only did this create a lot of crash dumps on some hosts, worse it was quickly filling up disk on some systems. We caught it in time and averted disaster - but we simply cannot allow this crash dumping to be active on production systems. So... In order to try and procure a crash dump for you, we have dedicated an instance to this purpose where we are stressing it a lot. We hope this will provoke a crash sometime in the next 24-48 hours - if not, then I'm not sure what we can do as then it seems synthetic workloads will not do the trick... We will see. Anyway, here is the latest crash on an otherwise healthy system, so you can see if it's the same (I'm thinking it is) Crash data
|
@mihalicyn Here is another case for you. I dont want to be reporting the same crash over and over, but I'm reporting this one as it's different somehow. So, we have "detection" of crashes which just consists of us checking if the lxcfs process is running with
In my mind, very odd behavior in this case. We checked on further hosts, and we actually have one more host which has two instances of lxcfs running, but there it seems OK and doing its job. We are not sure what to make of this. Here is the relevant crash dump: Crash data
|
One last addendum here @mihalicyn we are also the guys running everything under cgroupv1, because of this: #538 I dont know if that has any bearing on the issue, but we thought it was relevant to mention this. Edit: We can give you access to a test environment if you want to inspect some things or test something out. Just send us your public key at [email protected] and we will give you access. Thanks very much for your time. This issue is unfortunately becoming rather painful for us, so we really hope you can find time to check this further if you can. Fingers crossed :) |
@mihalicyn is there a way where we can limit core dumps to disk to only happen if lxcfs crashes and not other stuff? If so we would be able to enable this infrastructure-wide and catch this for sure... We tried looking around for info on this, but it's unclear how to achieve this exactly..if you know how, that would help |
We have now absolutely hammered a system for more than 24 hours with high load on cpu, memory, io and sirq. We nerfed our Arc cache and had 3 container instances doing a lot of activity. And.... Nothing. lxcfs runs fine, no crash of our VM So, we are unable to reproduce this synthetically it seems. Maybe we just need a lot more containers... An idea could be to spin up 100+ containers and have them all do... something. @mihalicyn Maybe to help us get a reproducer here, can you tell us if the crashing codepath relates to some specific virtualization in lxcfs? For example, if the code is tied to CPU or memory or whichever specific subsystem? Because then we could focus on creating diverse workloads against just that system, instead of hammering everything all at once. Anyway, we await your feedback here :) |
Hi, @webdock-io
You can try something like that:
|
This is a good option. Let's try to get some core dump with the method I've described above (with filtering) and if it doesn't work then giving me an access would be a last hope weapon. |
Ugh, that's painful. Because at some point cgroupv1 become deprecated in upstream, or even earlier, in systemd and you will have no chance to stay on it. I believe we should sort out this blocker that prevents you from the migration to cgroup-v2. It's extremely important. Let's discuss this in #538 |
That's a good question. Actually, I tend to believe that crash happens not because of an issue in LXCFS, but likely it's a stack corruption somewhere in libfuse or even a kernel problem. And to investigate this deeper I need to have a memory dump of the LXCFS process (ideally, a few different ones). |
@webdock-io I have edited your messages a bit and put all the crash data content inside a special markup:
this makes this issue thread a bit more readable for others :) |
@mihalicyn Thank you for the markup tip
I am very happy you see this as important, because it is! We were starting to have the discussion internally whether we should migrate over to KVM entirely and completely drop container VPS support, as this issue would be a complete blocker for proper container support in the near-future. Now that we know that this issue is being worked on, we will hold off on pressing the panic button anytime soon :) We are unfortunately still seeing occasional crashes as you know, and had one last night. Unfortunately we had not had the time to implement your new crash dump code. This will be done today on all our hosts, and I am confident we will have one or more crash dumps for you within the next few days. Thank you for your continued assistance and we will revert here once we have more material for you. |
@mihalicyn We have a crash dump for you! However...
This command looks a bit wrong and I tried a few things but I am unable to encrypt the dump for you. It looks to me you expect me to find your pubkey somewhere? I really tried my best reading the man page for gpg and looking around to see if I could find your pubkey somewhere but I am unable. Please provide more idiot-proof steps for me here, and I'll get this encrypted and placed somewhere you can grab it asap :) Thanks! |
@mihalicyn After much much faffing around I finally found your gpg key at keyserver.ubuntu.com and imported it with Anyway, here is the encrypted crash dump attached. I'll post further crash dumps as they come in :) I really hope this helps you find the issue! Edit: Github would not accept .gpg exension for uploads, so I just renamed the file and reversed the extension so github would eat it. You should rename it to .gz.gpg before decrypting and then decompressing |
@mihalicyn And here is new one from another host |
Hi @webdock-io !
Sorry about not providing you with a more detailed instruction. It was an optional step to encrypt it, but you did it right! Thanks for providing me with your coredumps, I'm on it to investigate this issue. |
Hello @mihalicyn We now have two more crash dumps for you: core-lxcfs.2897.gpg.gz Any progress on this issue so far? |
I'm facing this too I think as I noticed all the 4 cores of my host visible in one of my containers but for some reason I restarted my host and all looks good now. I too placed the logging mechanism, will let you guys know if I see a crash again. And let me know if you guys need any information from me. Thanks. |
Hi @webdock-io I posted analysis (for those who interested in these details) at the end of this message. First of all, thanks for helping us and providing core dumps. This is crucial. I think that we need to go even further with this and provide you with some special debug builds of LXCFS which can allow us to catch the issue. I'll think about a proper way to do so, of course, I can't just provide you with the binary and ask you to run it. It's just not how we do things. :) Likely, I'll prepare some special branch of LXCFS and instructions how to build it on your infrastructure and deploy. Also, please, can you tell me if you have any special or unusual software in containers on your infrastructure that can affect LXCFS? May be, you are using some monitoring system agent like node_exporter or something like that. Because you have a stable reproducer, if I would have such a stable reproducer things would be much-much easier to solve. Analysis of core-lxcfs.1147289
This specific coredump is extremely useful and interesting. We have invalid value of
Analysis of core-lxcfs.1499342
glibc detected double-free during releasing. We can't say anything more from that, because from coredump Analysis of core-lxcfs.2897
In this case, problem is here:
Analysis of core-lxcfs.3396
in proc_stat_read():
again, very weird,
|
Hi @mihalicyn Amazing work so far, but concerning that it's outside of lxcfs seemingly and thus potentially has deeper roots. Any wholesale lxcfs debug deployment would only be feasible if we can do so without a reboot of the hosts, as otherwise we'd have to deploy it whenever a host crashes anyway (which is unfortunately not rare these days, explanation to follow) - but let's see what you come up with and how we can best achieve this. Some information for you to go on:
We install lxd using snap, of course. Now, from what we can tell this crash happens on hosts which are using some amount of disk i/o - without fully understanding your analysis it seems the crash is related to disk i/o or am i mistaken? If it is, then what we're also seeing these days after migrating to latest kernel/Noble is occasional kernel crashes with no output in syslog or kern log other than null bytes - we have tracked this down to be very strongly correlated with the zfs arc cache and heavy read workloads. Essentially speaking, as long as the ARC RAM cache is not full and zfs can cache reads in memory things are fine - as soon as the ARC cache is filled and if the heavy read workloads continue, thus causing zfs to hit disk a lot, we get a kernel panic. This is admittedly a more serious issue for us and we are doing our best to mitigate by spreading workloads across our infrastructure and allowing zfs to use obscene amounts of ARC cache. I thought I'd mention this issue as maybe it's related. However, we are seeing these kernel crashes on bare metal as well as vm's and these kernel crashes do not coincide with lxcfs crashes in any meaningful way, which happen on relatively calm systems also. So yeah, not directly related but maybe relevant not sure, as we have a feeling that lxcfs is crashing a lot less now that we've increased our ARC cache across our entire infrastructure. Please let me know if there is any further, more specific information, you need. I would not say that we have a reliable reproducer per se, other than "this is happening fairly frequently" :) |
Ow! This is very-very important fact. Just to ensure that I understood you right. All LXCFS crashes happened only inside the VM instances. You have not seen any crashed on bare metal hosts yet. Right?
One of crashes happened on the read of Interesting thing here is that all these crashes are somehow similar and at the same time different. There is nothing in common from the LXCFS perspective. Different files like
This is painful. And, for sure, this is a priority 1 to fix. I would suggest to enable kdump to collect a coredump files for the kernel so you can report these crashes to the Ubuntu kernel maintainers. Please, refer to https://ubuntu.com/server/docs/kernel-crash-dump (feel free to ask me if you have any questions).
Hm, just in case, of course, you have tested your RAM modules with memtest+, disk SMARTs are fine too. I would say that this can be a sign of a very serious problem in the kernel. ZFS, while being a piece of art in software engineering, is an out-of-tree Linux kernel module, which brings additional risks because OpenZFS developers have to adapt it to each kernel version and errors are not excluded there. It can be a misuse of API because of API changes in the upstream kernel and so on. Which can lead to memory leaks, use-after-free errors, OOB r/w and so on. As a consequence, it may corrupt your kernel memory state and make you system to behave in extremely unexplainable ways. Even things which are not directly connected to ZFS or filesystems or disk I/O can start to break due to memory corruption (when a buggy code writes to a memory region which does not belong to it).
Memory corruption is a random thing. For example, one time it can override a memory structure of a "not important" (there is no unimportant things in the kernel ;-)) thing and you won't see any visible effects. At the same time, next time memory corruption can shot in a more important memory area which will be noticed immediately, because of crash of some process or VM or even the entire system.
And this is a very important observation. It can be that ZFS somehow corrupts memory and when you decreased a memory corruption rate LXCFS started to crash less often too. To conclude, I would suggest to:
May be related to openzfs/zfs#16187 |
Hi @capriciousduck ! Thanks for reporting this!
Please, can you tell me the following info from your system:
also, are you using just LXC+LXCFS or LXD or Incus? |
That's correct. As of yet lxcfs crashes have only happened inside VMs and not on bare metal yet
We looked into this, and the more we looked the more daunting this became. It is by no means as simple as shown in the Ubuntu docs it seems, and we ended up studying this guide in detail: https://hackmd.io/@0xff07/S1ASmzgun And we concluded that the difficulty in setting this up, and difficulty in reading the dumps (although, maybe we wouldn't actually need to read these ourselves, but rather pass this on to more knowledgeable persons) really discouraged us to even try. Not to mention this requires a reboot before it's even active, which is nothing we can/will do (unless a system has already crashed and we need to reboot anyway....) I guess we could attempt this, so that this is "ready" so that once a system needs a reboot anyway, from then on it will be active and then... We would see. If you can illuminate this for us / give us some pointers on how to get this set up I'd be willing to give it a shot.
This is all brand new systems with ECC RAM throughout and new drives with no SMART errors. We did run 24-48 hour high load testing where we stressed memory before deploying into production, which has always been enough/succesful in detecting bad RAM in our past experience. Not to mention these issues have happened across a wide range of hosts (and even platforms, Ryzen vs. Intel Xeon) so we are pretty confident this is not hardware related in any way.
Yes I agree from the information you have given us here and from what we've seen, this really smells like memory corruption caused by ZFS.
Maybe you could save us some research time and/or mistakes, if you can show us the steps on how to safely revert the kernel version? We have never done this, we have only ever upgraded the kernel :D
We found this issue as well in our research and felt it was very similar to our case! Although, we have been running with pretty recent kernels and zfs versions newer than the one reported in that issue, in production for a long time, and saw none of these crashes before - so we really don't know if this is the same or not. It's really only when we deployed systems with the latest Noble kernel and the zfs that comes with it, zfs-2.2.2-0ubuntu9, that we started seeing this. |
It is definitely worth it. As having a coredump will allow us to at least identify a callstack and report the issue properly (who knows, may be it's a known issue). As for reading a dump, don't worry, you can share it with me I'll help with that.
If you are using Noble, it looks like there is no choice. As 6.8 is a main kernel Noble released with. In this case you can try to deploy upstream kernel version + upstream ZFS 2.2.4 (https://openzfs.github.io/openzfs-docs/Developer%20Resources/Custom%20Packages.html#debian-and-ubuntu). ZFS update from 2.2.2 -> 2.2.4 should be pretty safe as it's minor and bugfix release. I don't know why zfs package in Ubuntu Noble is still on 2.2.2... Maybe, it's better to try to build newer ZFS and deploy it to see if it helps. And if not, then go for an upstream kernel versions (for example, https://kernel.ubuntu.com/mainline/v6.8.12/). |
You could try my mainline packages to get a clean upstream kernel (no Ubuntu changes) but still built with a config that's very close to the Ubuntu one. To do so, you'll need to:
|
Thanks so much @mihalicyn that is amazing. We will proceed to, at least perform the intial steps before the reboot, on all our hosts today, so they are somewhat ready to collect crash dumps. We have a single system which crashed again last night (after having crashed two times in the previous 12 hours, so a system in a pretty flaky state) where we fully enabled this, but there we also took some mitigation steps such as upgrading to the latest kernel (from 6.8.0-31 to 6.8.0-36), moving some high i/o users away and increasing the ARC cache even further. So it's not certain this system will crash again and provide a dump for us. I will of course ping you as soon as such a dump is produced. As to the suggestion from @mihalicyn to try zfs-dkms - we've for other reasons installed zfs-dkms in the past and now we have some systems on a legacy part of our infrastructure which has this issue: https://bugs.launchpad.net/ubuntu/+source/zfs-linux/+bug/2044630 As it's really not recommended to install zfs-dkms and seemingly if left for "too long" it can cause problems down the line where removing it is seemingly fraught with potential danger and would involve downtime for customers for sure, we are really hesitant to do something like this. Ideally we'd stay on official Ubuntu kernels and do things "the recommended way" Which leads me to the suggestion from @stgraber - this also strikes me as potentially dangerous in a couple of ways, firstly it is unclear to me whether we could easily revert to an official kernel with the built-in zfs support down the line. What if Stephane gets hit by a bus and no longer able to maintain the Zabbly builds? When would we know for sure official Ubuntu kernel is safe to switch back to? I really do appreciate the suggestion @stgraber but without further insight (and a proper plan) I am currently very hesitant to do something like this. With that said, I have reached out to you directly by email and I hope we can have a further chat about this in the coming days if you have time. So, for right now what we are doing is preparation and mitigation steps where we:
Anything beyond that will have to be the subject of further analysis, as we can't just willy-nilly go to "experimental" kernels with associated reboots and downtime for customers time and again :( I have a followup question for you guys We are deploying a number of new host machines next week on our infrastructure. This gives us the opportunity to do things differently, and hopefully deploy unaffected systems, which could be "safe havens" to migrate customers to which are triggering the issue on the unstable part of our infrastructure. Would we benefit from basing these systems on Ubuntu Jammy instead of Noble? Or would we just be getting the latest kernel/zfs modules, the same as for Noble, anyway and be in the same situation? I guess Jammy would give us better opportunity to run with an older kernel, say the latest kernel version we know was succesful in production before. I guess what I am asking is, what would you do in this situation? Any input / knowledge here would be appreciated :) |
@mihalicyn While we wait for a kernel crash dump, which I think is getting very close now as we have an unstable system which as of today is ready to dump, so it's just a matter of time before we have one (2-3 days max I'd say) - do you want further lxcfs crash dumps? We have 4 new crash dumps for lxcfs we can send through, if you find it relevant. |
Hello We now have captured a crash dump. This happened on a host where we were evaluating the zabbly mainline kernel in the hopes that this would solve our issue, so we purposefully moved problematic workloads there. However, in less than 24 hours it crashed. From my limited understanding of the dmesg, this does indeed look related to zfs, specifically arc_prune The kernel/software this was running on is:
HOWEVER there is a big caveat here - after the dump happened we got a report that a RAM module had failed and had been offlined. So it seems this particular system has at least a single bad ECC RAM module. For that reason, we cannot conclusively say that this kernel crash is the same as we've seen before or not. Here is the dump, encrypted with your key: https://krellide.webdock.io/crashdumps/crashdump.zip.gpg I have shared this with Stephane Graber as well. Hopefully you guys can ascertain next steps here. Our next test is that tomorrow we will bring 6 virtual machines up to this new kernel and zfs, which all had crashed lxcfs and thus are in a bad state. Since these systems have already proven prone to crashing lxcfs, we believe this "small scale" test will tell us within a few days, max a week, whether a kernel upgrade solves the lxcfs crashing issue at least. |
Update here. Upgrading machines to the latest zabbly kernel and zfs does not solve the crashing lxcfs issue :( We have now captured a crash of lxcfs on:
I thought it might be interesting for you to see a crash dump on this very much different kernel from the previous crash dumps you have seen, so you can find it here encrypted the same as before: |
Hey @webdock-io
unfortunately, this crashdump is not complete and I can't open it:
Please, check why you are getting an incomplete crashdumps.
BUT, even while you have an incomplete coredump, you still have a very-very informative dmesg message:
I would try to report this crash to ZFS developers and attach info about which ZFS version you had.
You are absolutely right. I'll check this lxcfs coredump today. |
This looks related openzfs/zfs#16324 |
Hey @webdock-io, could you please give me a binary of a ZFS kernel module from your node? You need something like It's worth mentioning that it only makes sense if you still have all the versions of kernel/zfs as at the time of crash. If not - then let's just wait for the next crash. |
btw, you can test a kdump mechanism with Because each crash is an invaluable piece of information for us (and also pain for a system administrator ;-) ). So it makes sense to ensure that we have a crashdump mechanism fully working before waiting for a next reproducer and hope that we will be able to extract any useful information. |
Hi @mihalicyn Here is the requested kernel module which was running at the time of the lxcfs crash: Unless you are refering to the kernel crash? In which case, it should be the same. As to the incomplete crash dump: The machine was only in a crashed state for a minute or two before we rebooted it. I suspect this is why the dump is incomplete, although I don't know how long a dump usually takes to complete? There is plenty of space on disk - maybe the memory setting is flawed, although we are using the default which to my knowledge uses a percentage of RAM and we have a lot of ram on that machine. Anyway, we will check this. |
Actually, this module is for another kernel:
while in your crash report you have:
I would suggest to check how crash dump collection work on practice:
actually, on the step 6 you'll likely receive an error and crash will ask you to provide a symbols (vmlinux). You need to install a if you see any error messages or anything else along the way then we should debug what's wrong with kdump and fill bug on Ubuntu. |
@webdock-io Just seeing this issue again. Have you noticed any LXCFS crash since we've put the various tweaks in place? From what I can recall we basically did:
|
No we are stable now What made the biggest difference was going back to a 6.6 series kernel - this seems to have solved all issues. Second, the per-instance LXCFS feature in Incus is amazing and really helps with this, but with that said we've had no complaints of crashed LXCFS since the kernel downgrade. As an aside, the latest ZFS-DKMS release 2.2.5-1 has yielded some amazing performance improvements as well, we are seeing 30%+ better ARC cache performance on that version. We are currently happy campers over here - but with that said, every new kernel release we tried until just a couple of weeks ago had serious issues with ZFS, which is of course worrying looking further ahead. |
Ok, all of that looks like a serious problem with integration between ZFS and Linux kernels 6.7+ then. Am I right in my understanding? |
Yeah, openzfs/zfs#16324 |
yeah, that one I have pointed out initially. Looks very-very bad and serious. May be I need to find some time to investigate that myself... |
btw a fix for openzfs/zfs#16324 has been merged in master |
I can build a new zfs version. But in the meantime, do we have anyone with a reproducer who can test? |
Ubuntu Noble
LXD v5.21.1 LTS + whatever version of lxcfs is packaged with this
We have a system with about 86 containers and no significant activity, a bit of load but nothing major. Out of our fleet of 60+ hosts which are pretty much identical to this system with some thousands of containers; on this host lxcfs crashes every couple of days. We have checked all logs we can think of, but all we can find is the crash dump itself in the syslog, as seen below.
Any help diagnosing these crashes is much appreciated, we suspect it's tied to some "unique" customer workload as it's only happening here. It's really rather painful having to reboot the entire host and bring down all instances with it, whenever this happens.
Crash data
The text was updated successfully, but these errors were encountered: