Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Passing negative-1 MaxWaitTime hangs DataMovementStatusRequest indefinitely #190

Open
mcfadden8 opened this issue Aug 1, 2024 · 3 comments

Comments

@mcfadden8
Copy link

The documentation says: "", but the data movement status request never call never returns.

2024-08-01 13:19:49:780 AXL rzadams1075: @ nnfdm_start:177 nnfdm::CreateRequest(src=/mnt/nnf/3c1bc64d-4355-48fa-898f-4af6c60d04b1-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0000-00000.silo, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/1/xxxx00000/xxxx-0000-00000.silo)
2024-08-01 13:19:49:804 AXL rzadams1075: @ nnfdm_start:177 nnfdm::CreateRequest(src=/mnt/nnf/3c1bc64d-4355-48fa-898f-4af6c60d04b1-0/martymcf/scr.defjobid/scr.dataset.1/xxxx00000.root, dst=/p/lustre1/martymcf/BDH/lustre-scr_sync_storage/1/xxxx00000.root)
2024-08-01 13:19:49:820 AXL rzadams1075: @ nnfdm_wait:352 0
2024-08-01 13:19:49:820 AXL rzadams1075: @ nnfdm_stat:65 /mnt/nnf/3c1bc64d-4355-48fa-898f-4af6c60d04b1-0/martymcf/scr.defjobid/scr.dataset.1/xxxx-0000-00000.silo

The same call will work if I pass 1 second and continue to poll for between 5 and 10 seconds.

@bdevcich
Copy link
Contributor

bdevcich commented Aug 2, 2024

It hangs even when the NnfDataMovement resource in kubernetes shows that it's finished? Can you check that once you make it hang?

This part of the API has always bothered me because I think a good API should always respond as quickly as possible to the client to minimize wait time and also confirm that nothing is wrong. It's like asking someone a question and they never respond.

Is this something that you use a lot?

@mcfadden8
Copy link
Author

How do I check that? Do you happen to have a test for this? Under what circumstances does it work?

I was only attempting to use it because the documentation said that I could. I reverted back to polling with a one-second timer. But we have use cases where users just want to wait until the copy is done before proceeding.

@bdevcich
Copy link
Contributor

bdevcich commented Aug 5, 2024

How do I check that? Do you happen to have a test for this? Under what circumstances does it work?

As it's running (and presumably hanging), you can query the NnfDataMovement resource in k8s. You won't be able to do this in your application unless the compute nodes have k8s access, but you could do it from somewhere that does. This is basically what the DataMovementStatusRequest is doing for you:

kubectl get -n <rabbit-hostname> nnfdatamovements <request UID>

So if compute-node-1 was attached to rabbit-node-1 and the DataMovementCreateRequest returned a UID of nnf-dm-node-5vghx, you can do this to query it:

$ kubectl get nnfdatamovement -n rabbit-node-1 nnf-dm-node-5vghx
NAME                STATE      STATUS    ERROR   AGE
nnf-dm-node-5vghx   Finished   Success           4m54s

A MaxWaitTime of -1 is not going to respond until that nnfdatamovement is done. So if it's a large request, it's going to appear to hang since the response won't come until it's finished. I'm hoping that's what happening here. If the nnfdatamovement resource is showing Finished and it's not responding, then we have an issue.

I reverted back to polling with a one-second timer. But we have use cases where users just want to wait until the copy is done before proceeding.

I think this is the best way to do this. It ensures that the server is responding and isn't hung.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Open
Development

No branches or pull requests

2 participants