Auto-resume for deep learning training is not working #1474

polyaxon-team · 2022-04-07T11:05:01Z

polyaxon-team
Apr 7, 2022
Maintainer

From slack

I'm trying to run a training job and make it resume automatically whenever it is preempted or it encounters an issue.
I'm using for this the "termination" and "maxRetries" field to restart the job.
After a problem happens, the job is restarted automatically starting from where the problem has happened if I look at the logs. However, nothing is being saved to the artifacts and any call to tracking.log_metric doesn't seem to have an effect. If I look at the logs, the job then continues until it reaches the end. However instead of just ending, it just keeps restarting (from the point where the problem occurred) until all the "maxRetries" are used and fails with the warning "Underlying job has an issue" at the status page.
Any idea what could cause such a problem and if there is anything I could do to avoid it?

Answered by polyaxon-team

Apr 7, 2022

Polyaxon provides several strategies to restart, restrat with copy mode, and resumes jobs. The auto-resume behavior is enabled by default

Note that resuming a job can only work if your code supports loading the last checkpoint.

Here's a quick debugging logic to check that the resuming process works as expected:

main.py

def main():
    tracking.init()
    checkpoint_path = tracking.get_outputs_path("checkpoint.json")
    checkpoint_path_exists = os.path.exists(checkpoint_path)
    print("[CHECKPOINT] path found: {}".format(checkpoint_path_exists))
    if checkpoint_path_exists:
        with open(checkpoint_path, "r") as checkpoint_file:
            checkpoint = json.loads(checkpoint_file.r…

View full answer

polyaxon-team · 2022-04-07T11:11:51Z

polyaxon-team
Apr 7, 2022
Maintainer Author

Polyaxon provides several strategies to restart, restrat with copy mode, and resumes jobs. The auto-resume behavior is enabled by default

Note that resuming a job can only work if your code supports loading the last checkpoint.

Here's a quick debugging logic to check that the resuming process works as expected:

main.py

def main():
    tracking.init()
    checkpoint_path = tracking.get_outputs_path("checkpoint.json")
    checkpoint_path_exists = os.path.exists(checkpoint_path)
    print("[CHECKPOINT] path found: {}".format(checkpoint_path_exists))
    if checkpoint_path_exists:
        with open(checkpoint_path, "r") as checkpoint_file:
            checkpoint = json.loads(checkpoint_file.read())
            print("[CHECKPOINT] last content: {}".format(checkpoint))
    else:
      print("[CHECKPOINT] init ...")
      checkpoint = {
        "last_time": time.time(),
        "last_index": 0,
        "array": [],
      }
    for i in range(checkpoint["last_index"] + 1, 300):
      print("[CHECKPOINT] step {}".format(i))
      tracking.log_progress((i + 1)/300)
      tracking.log_metric(name="index", value=i, step=i)
      checkpoint["array"].append(i)
      checkpoint["last_index"] = i
      checkpoint["last_time"] = time.time()
      if i in [10, 50]:
        print("[CHECKPOINT] Saving last content ...")
        with open(checkpoint_path, "w") as checkpoint_file:
          checkpoint_file.write(json.dumps(checkpoint))
        raise ValueError("Error was raised at {}".format(i))
      time.sleep(1)

polyaxonfile.yaml

version: 1.1
kind: component
termination:
  maxRetries: 3
run:
  kind: job
  container:
    image: polyaxon/polyaxon-examples:artifacts
    workingDir: "{{ globals.run_artifacts_path }}/uploads"
    command: ["/bin/bash", -c]
    args: ["pip install -U polyaxon --no-cache && python3 main.py"]

Logged a dummy metric that resumes from last checkpoint and (apart from the warning regression that I mentioned) the job succeeds after after two failures (you can see the first chart where the x-axis is the time that there's gap time)