Skip to content

Auto-resume for deep learning training is not working #1474

Discussion options

You must be logged in to vote

Polyaxon provides several strategies to restart, restrat with copy mode, and resumes jobs. The auto-resume behavior is enabled by default

Note that resuming a job can only work if your code supports loading the last checkpoint.

Here's a quick debugging logic to check that the resuming process works as expected:

  • main.py
def main():
    tracking.init()
    checkpoint_path = tracking.get_outputs_path("checkpoint.json")
    checkpoint_path_exists = os.path.exists(checkpoint_path)
    print("[CHECKPOINT] path found: {}".format(checkpoint_path_exists))
    if checkpoint_path_exists:
        with open(checkpoint_path, "r") as checkpoint_file:
            checkpoint = json.loads(checkpoint_file.r…

Replies: 1 comment

Comment options

You must be logged in to vote
0 replies
Answer selected by polyaxon-team
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
1 participant