Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No error crash, just a never ending pause #20523

Open
KeesariVigneshwarReddy opened this issue Dec 29, 2024 · 1 comment
Open

No error crash, just a never ending pause #20523

KeesariVigneshwarReddy opened this issue Dec 29, 2024 · 1 comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x

Comments

@KeesariVigneshwarReddy
Copy link

Bug description

Notebook - https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d
Github repo (ResidualUNetSE3D implementation) - https://github.com/wolny/pytorch-3dunet/tree/master

Issue - Training starts on 1x P100 GPU but it does not start on 2x T4 GPU

I want to use 2 GPUs simultaneoulsy for training (ddp_notebook strategy) But I do not know, training does not start and 2 GPUs were not in use

I have no idea "why it's not working".

Check the error messages and logs section.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

Go the Kaggle Notebook https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d-train

Copy & Edit 
Run All
You will encounter a never ending pause

Error messages and logs

When used with 2x T4 GPUs

n = len(folds)

for i in range(n):
    print(f'fold {i} started....')
    model = ResidualUNetSE3D(in_channels=1, out_channels=6)
    lm = CZIILightningModule(model=model)
    logger = CSVLogger(save_dir='/kaggle/working/training_results', name=f'fold_{i}')
    trainer = Trainer(accelerator='gpu',
                     strategy='ddp_notebook',
                     devices=2,
                     precision='32',
                     gradient_clip_val=None, 
                     logger=logger,
                     max_epochs=15,
                     enable_checkpointing=True,
                     enable_progress_bar=True,
                     enable_model_summary=False,
                     inference_mode=True,
                     default_root_dir='/kaggle/working/training_results',
                     num_sanity_val_steps=0)
    trainer.fit(model=lm, 
                train_dataloaders=DataLoader(folds[i][0], batch_size=1, num_workers=4, shuffle=True), 
                val_dataloaders=DataLoader(folds[i][1], batch_size=1, num_workers=4, shuffle=False))
    del model, lm, logger, trainer
    print(f'fold {i} completed....')

Screenshot 2024-12-29 092937

When used with 1x P100 GPU


Screenshot 2024-12-29 093405

Environment

Please go the kaggle notebook and run it

More info

No response

@KeesariVigneshwarReddy KeesariVigneshwarReddy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 29, 2024
@KeesariVigneshwarReddy
Copy link
Author

May be there is some serious issue with strategy='ddp_notebook'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.4.x
Projects
None yet
Development

No branches or pull requests

1 participant