-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deferred reference counts. #677
Comments
Shouldn't we need to consider the ideas the Facebook team (e.g. @colesbury, @DinoV) has for deferred reference counting too? |
I agree with everything and all the naming scheme except calling newref "dup" and decref "close". I don't think it's worth it to add additional mental overhead at the moment to CPython. CPython core devs know when it's right to decref, and when it's right to incref. The new names throw them off for something that currently have nearly exact same semantics as incref/decref for no additional clarity/benefit. |
Dup and close are not new and decref. Given we are changing all uses of references in the interpreter, it would be a shame to lose the opportunity to make this improvement. I think it is a bit presumptive to speak for all core devs, but the number of refcounting bugs that have needed fixing over the years shows that at least some core devs (myself included) make mistakes with refcounting. |
Which part of the docs is that? The docs mention:
|
When this was discussed in person I had not read or understood the proposal. Having now read it through, I am very worried about this, in terms of how disturbing it will be to the rest of the CPython code base (object implementations, stdlib extension modules) as well as 3rd party extensions (not counting the Limited API). IIUC every DECREF in the world will change its behavior so that if the refcount reaches zero the object is added to the ZCT, because we can never be sure that there aren't also stack references to an object. This feels like it will break too many cases where code makes (not clearly stated) assumptions about object/resource lifetimes. Many cases will be hard to find or not under our control (3rd party extensions). Doing GC more frequently would increase the frequency of stop the world events, right? That sounds like it could be potentially detrimental to free-threading performance. I wonder if one potential improvement would be if the ZCT entries were to be considered heap references, so that an object that only exists in the ZCT (plus potential stack references) would have a reference count of one, not zero. This would do nothing about lifetimes, but at least it would make it impossible to observe live objects with a zero refcount. For the rest of the algorithms around the ZCT it would make no difference. Basically, DECREF(x) would become if (x->ob_refcnt == 1) {
...add x to ZCT without changing its refcount...
} else {
assert(x->ob_refcnt > 1);
if (!_Py_IsImmortal(x)) {
x->ob_refcnt -= 1;
}
} This does nothing to absolve my worries about lifetimes or free-threading performance, though. |
Yes, but that isn't necessarily a problem.
Just to be clear, it is
A program cannot depend on unreachable objects, as they are unreachable. So object lifetimes isn't an issue. Resource lifetimes do matter, though. That is why we want to base the frequency of collections on the size of allocations, not their number. And don't forget that reference cycles already introduce arbitrary delays in freeing resources.
That make sense. We would want to change |
All the more reason to implement incremental collection for free-threading. |
A few examples of how removing the refcounts improves the JIT stencils: _LOAD_FAST_0From 7 to 3 instructions Main:
Deferred RC:
_STORE_FAST_0:From 17 to 3 instructions! Main:
Deferred RC:
_LOAD_ATTR_INSTANCE_VALUE_0:From 24 to 6 instructions Main:
Deferred RC:
|
Wait where are the instructions that test for the tag in the above JIT stencils? Or are you suggesting removing all decref/incref in these instructions? Or maybe these just for illustrative purposes? The alternative route we can do is refcount analysis like what Cinder does in their JIT. We can create stencil variants automatically that do not decref/incref inputs. For example, if something has a refcount of 1, we don't incref it again, and similarly don't decref it until there's something that makes it 0. |
Nevermind, after reading your scheme more closely, you're not using the tag for deferred stuff. |
Fourth option for Generators and coroutines:
|
And combined with stack caching, |
A couple more points:
|
when python call c code, how and where to save object pointers which are defined in c code? |
Another question is how to handle deferred RC in multi-thread case without GIL? |
CPython uses a combination of naive reference counting and a stack machine.
This results in a lot of reference counting operations that have a major impact on the performance of the JIT and interpreter.
We can reduce the cost of reference counting hugely by using deferred reference counting.
With deferred reference counting, we do not explicitly count references from the active part of the frame stack (in both local variables and the evaluation stack). Instead, we only count those references during GC.
Something like 70% of reference count operations occur in the interpreter. This fraction will vary, but if we start moving code from C to Python for performance and maintainability reasons, it will likely increase.
In order to implement deferred reference counting, we will need new C types for the stack references, some work on this has already been done.
Reference counts
Currently we maintain an exact reference count for each object in that object's header. This is the counted reference count:
counted RC == true RC
With deferred reference counts some references will not be counted in the object's header, but will be deferred:
counted RC + deferred RC == true RC
Note that, since no reference count can be negative, the true reference count can only be zero when both the deferred RC and the counted RC are zero.
This mean that we need a way to keep a handle to all objects whose counted RC reaches zero, so that we can collect them when we know that all deferred RCs are zero.
We will probably implement this using what is known as a Zero Count Table (ZCT). Any object whose counted RC reaches zero, or is assigned to a stack variable when created will be added to the ZCT.
During GC, when all deferred RCs are zero, any objects in the ZCT with a RC of zero can be reclaimed.
Prompt reclamation.
One important advantage of reference counting is prompt reclamation; there is not a long pause between an object becoming unreachable and it getting reclaimed.
In practice, it is not prompt reclamation of objects that matters but prompt reclamation of resources, primarily memory.
Rather than tracking the number of objects allocated to determine when GC should occur, we should track the total memory allocated.
If we are allocating multi-megabyte objects, GC should occur very frequently.
The algorithm
Allocation
Upon allocation of an object:
GC
A GC collection should do the following:
Details
Handling references
We will need to distinguish between heap references and stack references.
Stack references to an object contribute to the deferred RC of that object.
Heap references to an object contribute to the counted RC of that object.
For all code except the GC, we can use the C compiler to help us out.
We can define two opaque structs for stack and heap references
Because we cannot cast a
PyHRef
to aPySRef
or vice-versa, we must do through function calls(provided we are disciplined enough to not extract the bits directly).
The function calls can correctly adjust the reference counts, putting objects in the ZCT if necessary.
In debug mode we can use a bit as a dynamic marker to double-check that we are doing things correctly.
All interpreter frames, including generator and coroutine frames, must contain all deferred RCs or all immediate RCs. Executing frames must contain deferred RCs. The frame will need a bit to indicate whether references are deferred or immediate.
Lowering the cost of stack scanning
At every GC pause, which might be frequent, we need to make sure that the deferred reference count of all objects is zero.
To do that we need to scan the entirety of all stacks. That could be expensive.
To avoid having to scan the entire stack, we treat the lower part (furthest way from the current frame) as heap references and the top part (including the current frame) as stack references. The current frame must always be deferred, so we insert a special frame between the heap and stack parts to convert when a
RETURN
orYIELD
would otherwise start executing a frame of heap references.As we can rely on the special frame to convert from heap to stack references lazily, we don't need to perform that conversion in the GC.
The GC does need to convert stack to heap references, so at the end of a GC collection, all thread stacks will have a special frame as their current frame
Structures
In order to get the most help from the C compiler converting plain
PyObject *
pointers into references suitable for supporting deferred reference counting and tagged ints, we need to define some opaque structs:We should use the dup/close model of HPy for references, as it allows automatic finding of refcount errors which will be a boon for development.
The API
All the
Steal
variants are defined as above, but may be implemented more efficiently.Generators and coroutines
Generators and coroutines are executable objects containing references to other objects, like frames, but they are not on the frame stack; they are on the heap.
Here are three ways we can handle this:
This keeps execution and the compiler simple as generator frames are no different to normal frame. We do, however, need to track all generators so that the GC can convert the references when needed. With many live coroutines, this could get expensive.
This requires us to duplicate all instructions, which would be infeasible, or have the compiler insert additional reference count instructions where needed. With reasonable static analysis, this could perform reasonably well, but adds a lot of complexity to the compiler
STORE_FAST
, but other instructions are unchanged. The compiler will need to either, make sure the stack is empty when suspended, or make sure that all values on the stack are also present in a (counted) local variable. The latter is probably more efficient.Of these, I think 3. is the best. It is reasonably efficient, and can be implemented and tested prior to the rest of the work on deferred references.
Tagging
Since references on the stack can be either stack references or heap references, it will be relatively easy to get the type wrong.
To help prevent that we can tag references.
In theory no tags should be necessary, and we could just do this for debug builds. However, if we are to support tagged ints, we will need tagging anyway, so we might as well use it for all code.
For consistency with tagged ints, we want 0 to mean "not a heap reference", i.e. stack reference (or tagged int), leading to the following tagging scheme:
With tagged ints, we get this scheme:
The text was updated successfully, but these errors were encountered: