-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
leaks when loading tree sequence in SLiM #2868
Comments
I have reproduced the leak with valgrind. Investigating. |
So valgrind here is giving:
Which indicates we're trying to write uninit'd memory. Which is odd. Could it be that you're dumping a free'd ts? I haven't dug into slim code yet. Time to sleep! |
Hmm, odd. There's no sign of a problem like that; all we have is the memory leak, there's never a crash or any other bad behavior. The models seems to work fine, and it can reload the tree sequence over and over without problems. If it were relying upon uninited memory, I wouldn't expect that to be the case. SLiM passes all its unit tests, and lots of people are using it without problems; it just leaks. Very strange bug, so far. |
Could you post a version that does the same kind thing but runs much faster? Valgrind makes a process run much slower. |
Probably yes; give me a couple minutes. I apologize that this is disturbing your sleep; me too. :-O |
OK, I think this runs much faster and still exhibits the leak:
|
For a quick one using scripts from pyslim:
and the output is (lengthy and) attached. |
So does this mean that it is definitively a regression, then? Or has this test not been run under valgrind before? (It would be nice if there were a way to check with leaks with CI; I've not looked into that.) |
Hm, well I've just realized we haven't updated the tskit code in a bit (since C_1.0, and now we're at C_1.1.2). I've done a quick update to the code: MesserLab/SLiM#406 (but haven't yet checked if it works...). |
Yes, worthwhile for sure. I have a vague feeling that we had a patch living in SLiM; did you do a diff with the old sources before you copied in the new? Anyway, I can do a test tomorrow to see if this helps the problem, but if @benjeffery doesn't remember fixing this bug, my guess is that it won't... |
Well, running my test above with the tskit-updated SLiM results in fewer bytes "definitely lost" and a lot fewer entries in the valgrind log. Here's the result (attached). |
If this doesn't fix it, perhaps the thing to do is to extract out the tsk_X calls into a simple C program and verify there? |
I don't know why this issue got closed, WTF. Oh, @petrelharp – you need to watch your wording, you marked that PR as fixing this issue! :-> Reopening. For the record: there was indeed a patch in SLiM's copy of tskit, but it got resolved by the update so all is well. |
Looks to me like this has resolved the vast majority of the problem. We still get a 400-byte leak with every reload; I'm not sure where it's coming from exactly; one tool indicates the line |
Oh, |
Thanks for the update @petrelharp! |
I also did a quick test with my original slim script using slim from the last commit from @petrelharp (MesserLab/SLiM@dc0e413). Thank you @petrelharp @bhaller |
I don't remember fixing anything like this, but the tskit C API has pretty good leak checking as part of CI, so I'd be surprised if we have an issue there. I'll try again with the quicker code and the updated code. |
The quicker code is still taking over 30min to run under valgrind, and the pyslim "restart_WF" script is giving me:
So I can't really dig into this without a working minimal script. |
@benjeffery Hmm, "unrecognized named argument subpopMap" seems to indicate that you're running an old version of SLiM, I think? As for the script still taking more than 30min to run, yes – sorry, I'm an idiot. I assumed you were working in the debugger in the same way as me, observing leaks across one reload operation at a time, but of course you're not, you're running the script to completion under valgrind – and the script I supplied runs forever. It just keeps doing reloads over and over, ad infinitum, because that's how I was testing. (It's "faster" in the sense that the time to the first reload and the time between reloads is much shorter, which for some reason is what I thought you meant.) So, mea culpa, I apologize for wasting 30 minutes of your life. :-> Here is a script that does two reloads and then ends:
This should run to completion under valgrind in perhaps a minute or so; without valgrind it takes only ten seconds or so. Again, my apologies. |
I've just pushed a fix for the remaining 400-byte per reload leak, which was due to not calling free() on a malloced |
Thanks, I had a fresh clone, but somehow master was out-of-date. No idea why!
Ok, I'm going to do a little bit of digging, but glad this is fixed now. |
Hmm, having been through the diff I don't see anything. All the kastore memory ownership changes were already in 1.0 and are pretty heavily tested for leaks etc. I don't think it's worth digging any further here unless we get any other leak reports. Thanks for raising the issue. |
Interesting. Thanks for looking into it. Must be something about SLiM's particular usage pattern, I guess. It would be nice to have leak-testing as part of SLiM's CI; right now the leak-testing I do is just ad hoc. Something to ponder. |
The tskit ones are defined here: |
Phew! Glad that worked to fix things. |
Hi folks (ping @petrelharp). I need your help debugging a problem; I'm a bit mystified. Sorry for the long report.
It's a leak that occurs when SLiM is running a specific kind of model: one that saves out a tree sequence and then later reloads that same tree sequence, potentially reloading the same .trees file multiple times. With each reload, including the very first time, a bunch of memory is getting leaked. I can watch this happening live in the debugger, as I'll describe below. First of all, here's a SLiM model that makes it happen:
It's more crufty than it needs to be, but the details of it are unimportant. The only important bits are that it saves out a tree sequence (marked with
catn("WRITING...")
), and then later loads that tree sequence back in (marked withcatn("RESTARTING...")
in two places). The leak occurs across the call tosim.readFromPopulationFile()
, 100% reproducibly. I think I have checked for a leak from this specific usage pattern in the past, and it did not leak, so I suspect a regression somewhere, but perhaps I'm misremembering, or my test case back then was somehow different. But I suspect a regression somewhere.So, what does
sim.readFromPopulationFile()
do? After percolating through the innards of the Eidos interpreter's dispatcher, etc., we end up at some simple C++ code:The
...
bits are just cruft, definitely uninvolved, but I can come up with links to the lines in GitHub if you want to see the complete code. So, in a nutshell, first we free the tree sequence, then we load the replacement back in.FreeTreeSequence()
is just this:I've stepped through this in the debugger;
tsk_table_collection_free()
is definitely called. I walked through it and watched the tables getting freed one by one, and I looked at the table collection data structure afterwards and saw null pointers as appropriate; it definitely got freed. And note that this table collection,tables_
, is the same one thattsk_table_collection_load()
loads into; it's the permanent table collection kept by SLiM'sSpecies
class. So there shouldn't be any live data left in the table collection for us to overwrite (which would be the obvious way that a leak would occur in this scenario).OK, so here's the kicker. Step back out to
_InitializePopulationFromTskitBinaryFile()
, and up to the point wheretsk_table_collection_load()
is about to be called. Now – on macOS, there's a command-line tool calledleaks
that scans a running process for leaks and prints out a summary, and (it's wonderful) you can run it on a process that is stopped in the debugger. So, runleaks
at that moment; it reports zero leaks. Step overtsk_table_collection_load()
, and now there are a whole bunch of leaks reported. 100% reproducible.At a finer level of detail, one can step inside
tsk_table_collection_load()
and watched the leaks happen one by one. Insideread_table_cols()
, there is a call tokastore_containss()
. I can measure leaks before that call, and it says there are X leaks (based on how much leaking has already occurred during that debugger session). Step over that call and measure leaks again, and now there are X+1 leaks. Each new call tokastore_containss()
leaks one more malloced block (with a little variation in the exact timing, discussed below). These appear to be the full-size table columns getting leaked; the blocks are as big as 9 MB, for the run I'm looking at now. And if you suspect that theleaks
tool is lying to me (I did :->), note that a completely different macOS debugging tool, Instruments, also has a leak checker – and it reports the leaks as coming fromkastore_containss()
too.So,
kastore_containss()
percolates down to kastore_get(). That executes without any leak being evidenced; but the leak appears in the output fromleaks
soon thereafter, at what seems to be a somewhat unpredictable time. I suspect that a pointer to the leaked block remains in the stack for a little while, in some local variable that has fallen out of scope but has not yet been overwritten, and then when that stack memory happens to get overwritten the leak registers. I suspect that the culprit is somehow related to these lines inkastore_get()
:These lines do execute; the entry in item->array, which pointed to the new block just read in, gets set to NULL. I'm not sure what that code is doing exactly. I see that
tsk_table_collection_loadf_inited()
forces theKAS_GET_TAKES_OWNERSHIP
flag on. But I'm not 100% sure this is the culprit, it's just the most likely-looking code I can see in what I've stepped through.And this is where I bottom out, since I don't really understand how the kastore calls are intended to work exactly. As far as I can see, SLiM has correctly freed the table collection and then loaded a new .trees into it, and the leaks occur inside that load call, and register as leaks by the time the load call returns to SLiM's code. Whatever
KAS_GET_TAKES_OWNERSHIP
does exactly, that flag is not in SLiM's control, but is forced on by tskit. I'm drawing a blank as to how to pursue this further. Maybe the way we're using the table collection is just wrong; if so, please enlighten me :->. I'm happy to zoom, screen-share, and show you all this live in the debugger. Perhaps if you have SLiM installed you can reproduce it yourself, using the model given above, with whatever leaks tool you have on your platform.In summary: HELP!
The text was updated successfully, but these errors were encountered: