Data Movement failure output can be too large and prevents updates to the resource #230

bdevcich · 2025-01-16T15:45:25Z

We had a case where Data Movement took 5h45m before running out of capacity. In this case, the capacity issue was expected because the source file was much larger than the requested rabbit capacity.

#DW jobdw type=lustre name=iotest-stagein capacity=4800GB #DW copy_in source=/p/lustre4/devcich1/stage_in/ssf destination=$DW_JOB_iotest-stagein

The transfer rate was very slow (837.661 MiB/s). This resulted in lots of progress output messages (via dcp --progress 1). When the data movement failed once it reached capacity, the output was so big that the k8s resource could not be updated to indicate that the data movement failed. It tried update the error message with the large amount of output.

This is the last few lines of output from dcp. There are about 20630 lines preceding this since dcp --progress 1 outputs the progress every second. This is done so that the data movement controller can parse those lines to determine progress.

<snipped>
[2025-01-16T04:08:33] Copied 16.482 TiB (59%) in 20631.621 secs (837.662 MiB/s) 14418 secs left ...
[2025-01-16T04:08:33] Copied 16.482 TiB (59%) in 20632.588 secs (837.661 MiB/s) 14418 secs left ...
ABORT: rank X on HOST: Failed to write file /mnt/nnf/96ccf8e2-af13-451d-b9a4-e2f3fe74b77f-0/testfile errno=28 (No space left on device) @ /deps/mpifileutils/src/common/mfu_io.c:1055
[nnf-dm-controller-manager-86b974b9c4-wfwwz:00069] [[58362,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501

After this error message, we then see some errors from the controller because it tries to set the error message using this giant wall of output.

2025-01-16T04:08:34.949Z    ERROR    Fatal error    {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "NnfDataMovement": {"name":"fluxjob-219291797245925376-1","namespace":"nnf-dm-system"}, "namespace": "nnf-dm-system", "name": "flu
xjob-219291797245925376-1", "reconcileID": "acf27b14-ae5c-4b71-9658-5bd1e7bfe51a", "error": "internal error: exit status 255"}
github.com/DataWorkflowServices/dws/api/v1alpha2.(*ResourceError).SetResourceErrorAndLog
    /workspace/vendor/github.com/DataWorkflowServices/dws/api/v1alpha2/resource_error.go:190
github.com/NearNodeFlash/nnf-dm/internal/controller.(*DataMovementReconciler).Reconcile.func2
    /workspace/internal/controller/datamovement_controller.go:349
2025-01-16T04:08:34.986Z    ERROR    failed to update dm status with completion    {"controller": "nnfdatamovement", "controllerGroup": "nnf.cray.hpe.com", "controllerKind": "NnfDataMovement", "NnfDataMovement": {"name":"fluxjob-219291797245925376-1","namespace":"nnf-dm-system"}, "namespace"
: "nnf-dm-system", "name": "fluxjob-219291797245925376-1", "reconcileID": "acf27b14-ae5c-4b71-9658-5bd1e7bfe51a", "error": "Request entity too large: limit is 3145728"}
github.com/NearNodeFlash/nnf-dm/internal/controller.(*DataMovementReconciler).Reconcile.func2
    /workspace/internal/controller/datamovement_controller.go:382

This results in the workflow getting stuck in DataIn because the NnfDataMovement resource is stuck in Running even though it has finished and failed.

Additionally, flux does not seem to handle this case when canceling the workflow. The flux job is removed but the workflow gets orphaned. It does not appear to transition the workflow to Teardown.

The text was updated successfully, but these errors were encountered:

github-project-automation bot added this to Issues Dashboard Jan 16, 2025

github-project-automation bot moved this to 📋 Open in Issues Dashboard Jan 16, 2025

bdevcich added the bug Something isn't working label Jan 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Movement failure output can be too large and prevents updates to the resource #230

Data Movement failure output can be too large and prevents updates to the resource #230

bdevcich commented Jan 16, 2025

Data Movement failure output can be too large and prevents updates to the resource #230

Data Movement failure output can be too large and prevents updates to the resource #230

Comments

bdevcich commented Jan 16, 2025