AddFinalizer clears Status subobject leading to delayed object readiness #791

gravufo · 2024-12-06T21:04:42Z

What happened?

We create managed resources using only the Observe management policy and the provider will initially set the Synced condition to True but will not set the Ready condition at all.
We can see in the provider logs that the object seems to be successfully reconciled and is requeued according to the configured sync interval.
Once the sync interval is hit, the resource becomes ready.

How can we reproduce it?

Create any MR in the cluster with only the Observe management policy. We can easily reproduce this with provider-upjet-azure in any resource, for instance VirtualNetwork with the following spec:

apiVersion: network.azure.upbound.io/v1beta2
kind: VirtualNetwork
metadata:
  annotations:
    crossplane.io/external-name: vnet-test-crossplane
  name: vnet-test-crossplane
spec:
  deletionPolicy: Delete
  forProvider:
    resourceGroupName: rg-test-crossplane
  initProvider: {}
  managementPolicies:
  - Observe
  providerConfigRef:
    name: default

You will see the object become Synced almost immediately, but the Ready condition will stay nil. Also, the whole atProvider struct stays nil. Provider logs will show the following:

2024-12-06T20:57:33Z    DEBUG    provider-azure    Reconciling    {"controller": "managed/network.azure.upbound.io/v1beta1, kind=virtualnetwork", "request": {"name":"vnet-test-crossplane"}}
2024-12-06T20:57:33Z    DEBUG    provider-azure    Connecting to the service provider    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork"}
2024-12-06T20:57:33Z    DEBUG    provider-azure    Instance state not found in cache, reconstructing...    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork"}
2024-12-06T20:57:33Z    DEBUG    provider-azure    Observing the external resource    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork"}
2024-12-06T20:57:34Z    DEBUG    provider-azure    Diff detected    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork", "instanceDiff": "*terraform.InstanceDiff{mu:sync.Mutex{state:0, sema:0x0}, Attributes:map[string]*terraform.ResourceAttrDiff{\"location\":*terraform.ResourceAttrDiff{Old:\"canadacentral\", New:\"\", NewComputed:false, NewRemoved:true, NewExtra:interface {}(nil), RequiresNew:true, Sensitive:false, Type:0x0}}, Destroy:false, DestroyDeposed:false, DestroyTainted:false, RawConfig:cty.NilVal, RawState:cty.NilVal, RawPlan:cty.NilVal, Meta:map[string]interface {}(nil)}"}
2024-12-06T20:57:34Z    DEBUG    provider-azure    Skipping update due to managementPolicies. Reconciliation succeeded    {"controller": "managed/network.azure.upbound.io/v1beta1, kind=virtualnetwork", "request": {"name":"vnet-test-crossplane"}, "uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "version": "1176637016", "external-name": "vnet-test-crossplane", "requeue-after": "2024-12-07T02:49:57Z"}

You can either wait for the requeue to trigger or you can restart the provider.
After that, the Ready condition will be set to true.

What environment did it happen in?

Crossplane version: 1.18.0

Additional information

After step-by-step debugging, we have found that the bug seems to come from the AddFinalizer function in the APIFinalizer struct. This struct is used by provider-upjet-azure and probably all upjet based providers (although I didn't validate).
The AddFinalizer function calls the kube client's Update function which will apply an object in the kube API, but also return the updated object returned by the API to the caller. This, in turn, nullifies any struct that is not applied by the Update function such as the Status sub-struct.

Here are the relevant parts of the code:

AddFinalizer function
- This ends up calling Into from the kube client
Reconciler's call to AddFinalizer

The text was updated successfully, but these errors were encountered:

gravufo · 2024-12-06T21:56:53Z

I see 2 ways of fixing this:

In reconciler.go, we can save the status sub-struct and set it back after the finalizer call
In the AddFinalizer function, we can do the same

The benefit of 1 is that it applies to everyone, no matter the finalizer implementation they used while 2 only fixes that specific implementation.

We should probably also see if there are other places where this could happen.

mergenci · 2024-12-07T06:04:43Z

This, in turn, nullifies any struct that is not applied by the Update function such as the Status sub-struct.

I remember that the quoted behavior requires providers to reconcile multiple times in order to apply a set of changes. The comment in the late initialization logic acknowledges the behavior:

crossplane-runtime/pkg/reconciler/managed/reconciler.go

Lines 1276 to 1290 in 19d95a6

    
           if observation.ResourceLateInitialized && policy.ShouldLateInitialize() { 
        
           	// Note that this update may reset any pending updates to the status of 
        
           	// the managed resource from when it was observed above. This is because 
        
           	// the API server replies to the update with its unchanged view of the 
        
           	// resource's status, which is subsequently deserialized into managed. 
        
           	// This is usually tolerable because the update will implicitly requeue 
        
           	// an immediate reconcile which should re-observe the external resource 
        
           	// and persist its status. 
        
           	if err := r.client.Update(ctx, managed); err != nil { 
        
           		log.Debug(errUpdateManaged, "error", err) 
        
           		record.Event(managed, event.Warning(reasonCannotUpdateManaged, err)) 
        
           		managed.SetConditions(xpv1.ReconcileError(errors.Wrap(err, errUpdateManaged))) 
        
           		return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus) 
        
           	} 
        
           }

gravufo · 2024-12-07T15:34:38Z

You're right, that comment explains exactly the behaviour I'm seeing for the update call. However, it says that the update will implicitly requeue an immediate reconcile, but that is not what I'm seeing here, at least when the object is in Observe-only mode and when the update is a call to add the finalizer.
Either a behaviour changed in Kube client/kube controller runtime or something with Observe-only codepath makes it behave differently.
Having step-by-step debugged it, I can tell for sure that there was only 1 reconcile and the object was simply requeued for time which in my case is 6 hours. The object stayed at Synced = true and Ready = nil with a nil atProvider struct. The only ways to get the object ready are to force a reconcile somehow, either by restarting the provider or by making a dummy modification on the object (adding an annotation or whatever). You can also wait , but that is not an acceptable solution in my case.

I think that for efficiency it would be better to handle these cases in a single reconcile to reduce load on providers pods, provider's external API and kubernetes API. On thousands of objects, this can make a significant impact. Also, assuming the requeue was actually done immediately, it would still potentially be behind hundreds of other objects in the queue. Again, for the observe-only use-case where a single reconcile would be enough, I feel like we can optimize it.

Are you against saving the Status sub-struct and restoring it after the update call? If so, what alternative do you suggest?

bobh66 · 2024-12-07T17:17:42Z

I would imagine that the comment assumes the typical scenario of full control by Crossplane which includes a subsequent Create call which will trigger the update, and in the Observe case there is no subsequent operation to trigger the status update.

mergenci · 2024-12-29T13:57:16Z

Are you against saving the Status sub-struct and restoring it after the update call? If so, what alternative do you suggest?

I should have clarified that I don't have any opinions on the matter — I don't know enough about the managed reconciler design. I just wanted to provide reference for the described behavior.

gravufo · 2024-12-30T22:55:30Z

I understand. Either way, this is a real bug and I want to propose a solution. Is there any maintainer that could help push this forward by suggesting an alternative or approving my suggestion so I can implement and propose a PR?

turkenh · 2025-01-07T15:59:10Z

Thanks for the thorough investigation and detailed report @gravufo, I see the problem and agree with the pain it creates.

It is interesting adding a finalizer doesn't cause object to be re-queued. So, if I watch a resource with "kubectl get foo" and in another terminal I add a finalizer to it, no new watch event would be thrown? 🤔

In reconciler.go, we can save the status sub-struct and set it back after the finalizer call

This seems like a reasonable solution, and I can't think of a better way to handle this. So, feel free to come up with a PR.

gravufo · 2025-01-10T22:35:24Z

According to kedacore/keda#437 (comment) it would seem that adding finalizers should not update the generation, thus not triggering a reconcile. It makes sense.
However, it seems like sometimes it works if the update actually adds fields that were not initially defined (if they are not pointers in the schema), thus triggering a reconcile in those cases.
This matches the behaviour we are seeing where it works in some types of MRs but not others.
I will go ahead and create a PR for this.

gravufo · 2025-01-10T23:41:46Z

@turkenh It's not quite easy to modify the status sub-object since there are no helper methods for that and it feels awkward having to use runtime.DefaultUnstructuredConverter.ToUnstructured. Do you have any suggestion on how to approach this?

The other option I can see is to move the AddFinalizer block before the Observe call. I'm not sure why it's so low in the reconcile loop, I feel like it should basically be the first thing being done. Unless we are trying to ensure we can handle the object before setting it?

gravufo added the bug Something isn't working label Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AddFinalizer clears Status subobject leading to delayed object readiness #791

AddFinalizer clears Status subobject leading to delayed object readiness #791

gravufo commented Dec 6, 2024 •

edited

Loading

gravufo commented Dec 6, 2024

mergenci commented Dec 7, 2024

gravufo commented Dec 7, 2024

bobh66 commented Dec 7, 2024 •

edited

Loading

mergenci commented Dec 29, 2024

gravufo commented Dec 30, 2024

turkenh commented Jan 7, 2025 •

edited

Loading

gravufo commented Jan 10, 2025

gravufo commented Jan 10, 2025

AddFinalizer clears Status subobject leading to delayed object readiness #791

AddFinalizer clears Status subobject leading to delayed object readiness #791

Comments

gravufo commented Dec 6, 2024 • edited Loading

What happened?

How can we reproduce it?

What environment did it happen in?

Additional information

gravufo commented Dec 6, 2024

mergenci commented Dec 7, 2024

gravufo commented Dec 7, 2024

bobh66 commented Dec 7, 2024 • edited Loading

mergenci commented Dec 29, 2024

gravufo commented Dec 30, 2024

turkenh commented Jan 7, 2025 • edited Loading

gravufo commented Jan 10, 2025

gravufo commented Jan 10, 2025

gravufo commented Dec 6, 2024 •

edited

Loading

bobh66 commented Dec 7, 2024 •

edited

Loading

turkenh commented Jan 7, 2025 •

edited

Loading