[Bug] Inconsistent usage of `file_client` in `CheckpointHook`. The previous best checkpoints are not removed for file backends other than `disk`. #1084

tyomj · 2023-04-18T12:30:01Z

Prerequisite

I have searched Issues and Discussions but cannot get the expected help.
The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

OrderedDict([('sys.platform', 'linux'), ('Python', '3.8.16 (default, Mar 23 2023, 17:40:49) [GCC 10.2.1 20210110]'), ('CUDA available', True), ('numpy_random_seed', 2147483648), ('GPU 0', 'Tesla V100-SXM2-16GB'), ('CUDA_HOME', None), ('GCC', 'gcc (Debian 10.2.1-6) 10.2.1 20210110'), ('PyTorch', '2.0.0+cu117'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.7\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86\n - CuDNN 8.5\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.7, CUDNN_VERSION=8.5.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.15.1+cu117'), ('OpenCV', '4.7.0'), ('MMEngine', '0.7.2')])

Reproduces the problem - code sample

Config example
The example uses aws backend implemeted in this PR. One may test with Petrel backend as well to reproduce.

# Client args
# ...
beargs = dict(
    backend="aws",
    path_mapping=dict(
        {
            "./work_dirs": f"aws://bucket_name/{run_id}",
            "/app/mmdetection3d/work_dirs": f"aws://bucket_name/{run_id}",
        }
    ),
)
# ...
# Define CheckpointHook
default_hooks = dict(
    checkpoint=dict(
        type="CheckpointHook",
        save_best='some_key',
        max_keep_ckpts=1,
        backend_args=beargs
    ),
)

Reproduces the problem - command or script

python tools/train.py ../config.py

Reproduces the problem - error message

Unexpected behaviour.

Additional information

Expected behavior:
The system should save only one best checkpoint at a time, and if a new checkpoint becomes better than the previous one, the old checkpoint should be removed.
Current behavior:
The system saves a new checkpoint in addition to the previous one, which results in multiple checkpoints being saved.
The reason:
The issue arises due to the wrong simultaneous definition of backend_args and file_client_args. If backend_args is defined, the file_client is set to default to disk. The file_backend is used throughout the code, except in _save_best_checkpoint. Hence, when the system tries to locate the previous checkpoint at this location, it searches on the local drive.
A proposed solution:
The simultaneous definition of backend_args and file_client_args at this location should be changed here
to

        if self.file_client_args is None:
            self.file_backend = get_file_backend(
                self.out_dir, backend_args=self.backend_args)
            self.file_client = self.file_backend
        else:
            self.file_client = FileClient.infer_client(self.file_client_args,
                                                   self.out_dir)
            self.file_backend = self.file_client

Additionally, self.file_client should be replaced with self.file_backend at this location here to ensure that the system locates the previous checkpoint correctly.

The text was updated successfully, but these errors were encountered:

HAOCHENYE · 2023-04-19T06:50:07Z

Thanks for your feedback, I've tried to fix this Bug in #1086, it could be helpful if you can review this PR 😄

tyomj added the bug Something isn't working label Apr 18, 2023

tyomj changed the title ~~[Bug] Inconsistent usage of file_client in CheckpointHook~~ [Bug] Inconsistent usage of file_client in CheckpointHook. The previous best checkpoints are not removed for file backends other than disk. Apr 18, 2023

HAOCHENYE mentioned this issue Apr 19, 2023

[Fix] Failed to remove the previous best checkpoints #1086

Merged

zhouzaida closed this as completed in #1086 Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Inconsistent usage of `file_client` in `CheckpointHook`. The previous best checkpoints are not removed for file backends other than `disk`. #1084

[Bug] Inconsistent usage of `file_client` in `CheckpointHook`. The previous best checkpoints are not removed for file backends other than `disk`. #1084

tyomj commented Apr 18, 2023

HAOCHENYE commented Apr 19, 2023

[Bug] Inconsistent usage of file_client in CheckpointHook. The previous best checkpoints are not removed for file backends other than disk. #1084

[Bug] Inconsistent usage of file_client in CheckpointHook. The previous best checkpoints are not removed for file backends other than disk. #1084

Comments

tyomj commented Apr 18, 2023

Prerequisite

Environment

Reproduces the problem - code sample

Reproduces the problem - command or script

Reproduces the problem - error message

Additional information

HAOCHENYE commented Apr 19, 2023

[Bug] Inconsistent usage of `file_client` in `CheckpointHook`. The previous best checkpoints are not removed for file backends other than `disk`. #1084

[Bug] Inconsistent usage of `file_client` in `CheckpointHook`. The previous best checkpoints are not removed for file backends other than `disk`. #1084