Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AddFinalizer clears Status subobject leading to delayed object readiness #791

Open
gravufo opened this issue Dec 6, 2024 · 9 comments
Open
Labels
bug Something isn't working

Comments

@gravufo
Copy link

gravufo commented Dec 6, 2024

What happened?

We create managed resources using only the Observe management policy and the provider will initially set the Synced condition to True but will not set the Ready condition at all.
We can see in the provider logs that the object seems to be successfully reconciled and is requeued according to the configured sync interval.
Once the sync interval is hit, the resource becomes ready.

How can we reproduce it?

  1. Create any MR in the cluster with only the Observe management policy. We can easily reproduce this with provider-upjet-azure in any resource, for instance VirtualNetwork with the following spec:
apiVersion: network.azure.upbound.io/v1beta2
kind: VirtualNetwork
metadata:
  annotations:
    crossplane.io/external-name: vnet-test-crossplane
  name: vnet-test-crossplane
spec:
  deletionPolicy: Delete
  forProvider:
    resourceGroupName: rg-test-crossplane
  initProvider: {}
  managementPolicies:
  - Observe
  providerConfigRef:
    name: default
  1. You will see the object become Synced almost immediately, but the Ready condition will stay nil. Also, the whole atProvider struct stays nil. Provider logs will show the following:
2024-12-06T20:57:33Z    DEBUG    provider-azure    Reconciling    {"controller": "managed/network.azure.upbound.io/v1beta1, kind=virtualnetwork", "request": {"name":"vnet-test-crossplane"}}
2024-12-06T20:57:33Z    DEBUG    provider-azure    Connecting to the service provider    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork"}
2024-12-06T20:57:33Z    DEBUG    provider-azure    Instance state not found in cache, reconstructing...    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork"}
2024-12-06T20:57:33Z    DEBUG    provider-azure    Observing the external resource    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork"}
2024-12-06T20:57:34Z    DEBUG    provider-azure    Diff detected    {"uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "name": "vnet-test-crossplane", "gvk": "network.azure.upbound.io/v1beta1, Kind=VirtualNetwork", "instanceDiff": "*terraform.InstanceDiff{mu:sync.Mutex{state:0, sema:0x0}, Attributes:map[string]*terraform.ResourceAttrDiff{\"location\":*terraform.ResourceAttrDiff{Old:\"canadacentral\", New:\"\", NewComputed:false, NewRemoved:true, NewExtra:interface {}(nil), RequiresNew:true, Sensitive:false, Type:0x0}}, Destroy:false, DestroyDeposed:false, DestroyTainted:false, RawConfig:cty.NilVal, RawState:cty.NilVal, RawPlan:cty.NilVal, Meta:map[string]interface {}(nil)}"}
2024-12-06T20:57:34Z    DEBUG    provider-azure    Skipping update due to managementPolicies. Reconciliation succeeded    {"controller": "managed/network.azure.upbound.io/v1beta1, kind=virtualnetwork", "request": {"name":"vnet-test-crossplane"}, "uid": "bb757cb5-8934-471d-981b-7862513ac9b1", "version": "1176637016", "external-name": "vnet-test-crossplane", "requeue-after": "2024-12-07T02:49:57Z"} 
  1. You can either wait for the requeue to trigger or you can restart the provider.
  2. After that, the Ready condition will be set to true.

What environment did it happen in?

Crossplane version: 1.18.0

Additional information

After step-by-step debugging, we have found that the bug seems to come from the AddFinalizer function in the APIFinalizer struct. This struct is used by provider-upjet-azure and probably all upjet based providers (although I didn't validate).
The AddFinalizer function calls the kube client's Update function which will apply an object in the kube API, but also return the updated object returned by the API to the caller. This, in turn, nullifies any struct that is not applied by the Update function such as the Status sub-struct.

Here are the relevant parts of the code:

@gravufo gravufo added the bug Something isn't working label Dec 6, 2024
@gravufo
Copy link
Author

gravufo commented Dec 6, 2024

I see 2 ways of fixing this:

  1. In reconciler.go, we can save the status sub-struct and set it back after the finalizer call
  2. In the AddFinalizer function, we can do the same

The benefit of 1 is that it applies to everyone, no matter the finalizer implementation they used while 2 only fixes that specific implementation.

We should probably also see if there are other places where this could happen.

@mergenci
Copy link
Member

mergenci commented Dec 7, 2024

This, in turn, nullifies any struct that is not applied by the Update function such as the Status sub-struct.

I remember that the quoted behavior requires providers to reconcile multiple times in order to apply a set of changes. The comment in the late initialization logic acknowledges the behavior:

if observation.ResourceLateInitialized && policy.ShouldLateInitialize() {
// Note that this update may reset any pending updates to the status of
// the managed resource from when it was observed above. This is because
// the API server replies to the update with its unchanged view of the
// resource's status, which is subsequently deserialized into managed.
// This is usually tolerable because the update will implicitly requeue
// an immediate reconcile which should re-observe the external resource
// and persist its status.
if err := r.client.Update(ctx, managed); err != nil {
log.Debug(errUpdateManaged, "error", err)
record.Event(managed, event.Warning(reasonCannotUpdateManaged, err))
managed.SetConditions(xpv1.ReconcileError(errors.Wrap(err, errUpdateManaged)))
return reconcile.Result{Requeue: true}, errors.Wrap(r.client.Status().Update(ctx, managed), errUpdateManagedStatus)
}
}

@gravufo
Copy link
Author

gravufo commented Dec 7, 2024

You're right, that comment explains exactly the behaviour I'm seeing for the update call. However, it says that the update will implicitly requeue an immediate reconcile, but that is not what I'm seeing here, at least when the object is in Observe-only mode and when the update is a call to add the finalizer.
Either a behaviour changed in Kube client/kube controller runtime or something with Observe-only codepath makes it behave differently.
Having step-by-step debugged it, I can tell for sure that there was only 1 reconcile and the object was simply requeued for time which in my case is 6 hours. The object stayed at Synced = true and Ready = nil with a nil atProvider struct. The only ways to get the object ready are to force a reconcile somehow, either by restarting the provider or by making a dummy modification on the object (adding an annotation or whatever). You can also wait , but that is not an acceptable solution in my case.

I think that for efficiency it would be better to handle these cases in a single reconcile to reduce load on providers pods, provider's external API and kubernetes API. On thousands of objects, this can make a significant impact. Also, assuming the requeue was actually done immediately, it would still potentially be behind hundreds of other objects in the queue. Again, for the observe-only use-case where a single reconcile would be enough, I feel like we can optimize it.

Are you against saving the Status sub-struct and restoring it after the update call? If so, what alternative do you suggest?

@bobh66
Copy link
Contributor

bobh66 commented Dec 7, 2024

I would imagine that the comment assumes the typical scenario of full control by Crossplane which includes a subsequent Create call which will trigger the update, and in the Observe case there is no subsequent operation to trigger the status update.

@mergenci
Copy link
Member

Are you against saving the Status sub-struct and restoring it after the update call? If so, what alternative do you suggest?

I should have clarified that I don't have any opinions on the matter — I don't know enough about the managed reconciler design. I just wanted to provide reference for the described behavior.

@gravufo
Copy link
Author

gravufo commented Dec 30, 2024

I understand. Either way, this is a real bug and I want to propose a solution. Is there any maintainer that could help push this forward by suggesting an alternative or approving my suggestion so I can implement and propose a PR?

@turkenh
Copy link
Member

turkenh commented Jan 7, 2025

Thanks for the thorough investigation and detailed report @gravufo, I see the problem and agree with the pain it creates.

It is interesting adding a finalizer doesn't cause object to be re-queued. So, if I watch a resource with "kubectl get foo" and in another terminal I add a finalizer to it, no new watch event would be thrown? 🤔

  1. In reconciler.go, we can save the status sub-struct and set it back after the finalizer call

This seems like a reasonable solution, and I can't think of a better way to handle this. So, feel free to come up with a PR.

@gravufo
Copy link
Author

gravufo commented Jan 10, 2025

According to kedacore/keda#437 (comment) it would seem that adding finalizers should not update the generation, thus not triggering a reconcile. It makes sense.
However, it seems like sometimes it works if the update actually adds fields that were not initially defined (if they are not pointers in the schema), thus triggering a reconcile in those cases.
This matches the behaviour we are seeing where it works in some types of MRs but not others.
I will go ahead and create a PR for this.

@gravufo
Copy link
Author

gravufo commented Jan 10, 2025

@turkenh It's not quite easy to modify the status sub-object since there are no helper methods for that and it feels awkward having to use runtime.DefaultUnstructuredConverter.ToUnstructured. Do you have any suggestion on how to approach this?

The other option I can see is to move the AddFinalizer block before the Observe call. I'm not sure why it's so low in the reconcile loop, I feel like it should basically be the first thing being done. Unless we are trying to ensure we can handle the object before setting it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants