Auto-resume for deep learning training is not working #1474
-
I'm trying to run a training job and make it resume automatically whenever it is preempted or it encounters an issue. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Polyaxon provides several strategies to restart, restrat with copy mode, and resumes jobs. The auto-resume behavior is enabled by default Note that resuming a job can only work if your code supports loading the last checkpoint. Here's a quick debugging logic to check that the resuming process works as expected:
def main():
tracking.init()
checkpoint_path = tracking.get_outputs_path("checkpoint.json")
checkpoint_path_exists = os.path.exists(checkpoint_path)
print("[CHECKPOINT] path found: {}".format(checkpoint_path_exists))
if checkpoint_path_exists:
with open(checkpoint_path, "r") as checkpoint_file:
checkpoint = json.loads(checkpoint_file.read())
print("[CHECKPOINT] last content: {}".format(checkpoint))
else:
print("[CHECKPOINT] init ...")
checkpoint = {
"last_time": time.time(),
"last_index": 0,
"array": [],
}
for i in range(checkpoint["last_index"] + 1, 300):
print("[CHECKPOINT] step {}".format(i))
tracking.log_progress((i + 1)/300)
tracking.log_metric(name="index", value=i, step=i)
checkpoint["array"].append(i)
checkpoint["last_index"] = i
checkpoint["last_time"] = time.time()
if i in [10, 50]:
print("[CHECKPOINT] Saving last content ...")
with open(checkpoint_path, "w") as checkpoint_file:
checkpoint_file.write(json.dumps(checkpoint))
raise ValueError("Error was raised at {}".format(i))
time.sleep(1)
version: 1.1
kind: component
termination:
maxRetries: 3
run:
kind: job
container:
image: polyaxon/polyaxon-examples:artifacts
workingDir: "{{ globals.run_artifacts_path }}/uploads"
command: ["/bin/bash", -c]
args: ["pip install -U polyaxon --no-cache && python3 main.py"] |
Beta Was this translation helpful? Give feedback.
Polyaxon provides several strategies to restart, restrat with copy mode, and resumes jobs. The auto-resume behavior is enabled by default
Note that resuming a job can only work if your code supports loading the last checkpoint.
Here's a quick debugging logic to check that the resuming process works as expected:
main.py