Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Movement failure output can be too large and prevents updates to the resource #230

Open
bdevcich opened this issue Jan 16, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@bdevcich
Copy link
Contributor

We had a case where Data Movement took 5h45m before running out of capacity. In this case, the capacity issue was expected because the source file was much larger than the requested rabbit capacity.

#DW jobdw type=lustre name=iotest-stagein capacity=4800GB #DW copy_in source=/p/lustre4/devcich1/stage_in/ssf destination=$DW_JOB_iotest-stagein

The transfer rate was very slow (837.661 MiB/s). This resulted in lots of progress output messages (via dcp --progress 1). When the data movement failed once it reached capacity, the output was so big that the k8s resource could not be updated to indicate that the data movement failed. It tried update the error message with the large amount of output.

This is the last few lines of output from dcp. There are about 20630 lines preceding this since dcp --progress 1 outputs the progress every second. This is done so that the data movement controller can parse those lines to determine progress.

<snipped>
[2025-01-16T04:08:33] Copied 16.482 TiB (59%) in 20631.621 secs (837.662 MiB/s) 14418 secs left ...
[2025-01-16T04:08:33] Copied 16.482 TiB (59%) in 20632.588 secs (837.661 MiB/s) 14418 secs left ...
ABORT: rank X on HOST: Failed to write file /mnt/nnf/96ccf8e2-af13-451d-b9a4-e2f3fe74b77f-0/testfile errno=28 (No space left on device) @ /deps/mpifileutils/src/common/mfu_io.c:1055
[nnf-dm-controller-manager-86b974b9c4-wfwwz:00069] [[58362,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501

After this error message, we then see some errors from the controller because it tries to set the error message using this giant wall of output.

2025-01-16T04:08:34.949Z    ERROR    Fatal error    {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "NnfDataMovement": {"name":"fluxjob-219291797245925376-1","namespace":"nnf-dm-system"}, "namespace": "nnf-dm-system", "name": "flu
xjob-219291797245925376-1", "reconcileID": "acf27b14-ae5c-4b71-9658-5bd1e7bfe51a", "error": "internal error: exit status 255"}
github.com/DataWorkflowServices/dws/api/v1alpha2.(*ResourceError).SetResourceErrorAndLog
    /workspace/vendor/github.com/DataWorkflowServices/dws/api/v1alpha2/resource_error.go:190
github.com/NearNodeFlash/nnf-dm/internal/controller.(*DataMovementReconciler).Reconcile.func2
    /workspace/internal/controller/datamovement_controller.go:349
2025-01-16T04:08:34.986Z    ERROR    failed to update dm status with completion    {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "NnfDataMovement": {"name":"fluxjob-219291797245925376-1","namespace":"nnf-dm-system"}, "namespace"
: "nnf-dm-system", "name": "fluxjob-219291797245925376-1", "reconcileID": "acf27b14-ae5c-4b71-9658-5bd1e7bfe51a", "error": "Request entity too large: limit is 3145728"}
github.com/NearNodeFlash/nnf-dm/internal/controller.(*DataMovementReconciler).Reconcile.func2
    /workspace/internal/controller/datamovement_controller.go:382

This results in the workflow getting stuck in DataIn because the NnfDataMovement resource is stuck in Running even though it has finished and failed.

Additionally, flux does not seem to handle this case when canceling the workflow. The flux job is removed but the workflow gets orphaned. It does not appear to transition the workflow to Teardown.

@bdevcich bdevcich added the bug Something isn't working label Jan 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 📋 Open
Development

No branches or pull requests

1 participant