You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We had a case where Data Movement took 5h45m before running out of capacity. In this case, the capacity issue was expected because the source file was much larger than the requested rabbit capacity.
The transfer rate was very slow (837.661 MiB/s). This resulted in lots of progress output messages (via dcp --progress 1). When the data movement failed once it reached capacity, the output was so big that the k8s resource could not be updated to indicate that the data movement failed. It tried update the error message with the large amount of output.
This is the last few lines of output from dcp. There are about 20630 lines preceding this since dcp --progress 1 outputs the progress every second. This is done so that the data movement controller can parse those lines to determine progress.
<snipped>
[2025-01-16T04:08:33] Copied 16.482 TiB (59%) in 20631.621 secs (837.662 MiB/s) 14418 secs left ...
[2025-01-16T04:08:33] Copied 16.482 TiB (59%) in 20632.588 secs (837.661 MiB/s) 14418 secs left ...
ABORT: rank X on HOST: Failed to write file /mnt/nnf/96ccf8e2-af13-451d-b9a4-e2f3fe74b77f-0/testfile errno=28 (No space left on device) @ /deps/mpifileutils/src/common/mfu_io.c:1055
[nnf-dm-controller-manager-86b974b9c4-wfwwz:00069] [[58362,0],0] ORTE_ERROR_LOG: Data unpack failed in file util/show_help.c at line 501
After this error message, we then see some errors from the controller because it tries to set the error message using this giant wall of output.
This results in the workflow getting stuck in DataIn because the NnfDataMovement resource is stuck in Running even though it has finished and failed.
Additionally, flux does not seem to handle this case when canceling the workflow. The flux job is removed but the workflow gets orphaned. It does not appear to transition the workflow to Teardown.
The text was updated successfully, but these errors were encountered:
We had a case where Data Movement took 5h45m before running out of capacity. In this case, the capacity issue was expected because the source file was much larger than the requested rabbit capacity.
The transfer rate was very slow (837.661 MiB/s). This resulted in lots of progress output messages (via
dcp --progress 1
). When the data movement failed once it reached capacity, the output was so big that the k8s resource could not be updated to indicate that the data movement failed. It tried update the error message with the large amount of output.This is the last few lines of output from
dcp
. There are about 20630 lines preceding this sincedcp --progress 1
outputs the progress every second. This is done so that the data movement controller can parse those lines to determine progress.After this error message, we then see some errors from the controller because it tries to set the error message using this giant wall of output.
This results in the workflow getting stuck in DataIn because the
NnfDataMovement
resource is stuck in Running even though it has finished and failed.Additionally, flux does not seem to handle this case when canceling the workflow. The flux job is removed but the workflow gets orphaned. It does not appear to transition the workflow to Teardown.
The text was updated successfully, but these errors were encountered: