No error crash, just a never ending pause #20523

KeesariVigneshwarReddy · 2024-12-29T09:32:16Z

Bug description

Notebook - https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d
Github repo (ResidualUNetSE3D implementation) - https://github.com/wolny/pytorch-3dunet/tree/master

Issue - Training starts on 1x P100 GPU but it does not start on 2x T4 GPU

I want to use 2 GPUs simultaneoulsy for training (ddp_notebook strategy) But I do not know, training does not start and 2 GPUs were not in use

I have no idea "why it's not working".

Check the error messages and logs section.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

Go the Kaggle Notebook https://www.kaggle.com/code/vigneshwar472/baseline-residualunetse3d-train

Copy & Edit 
Run All
You will encounter a never ending pause

Error messages and logs

When used with 2x T4 GPUs

n = len(folds)

for i in range(n):
    print(f'fold {i} started....')
    model = ResidualUNetSE3D(in_channels=1, out_channels=6)
    lm = CZIILightningModule(model=model)
    logger = CSVLogger(save_dir='/kaggle/working/training_results', name=f'fold_{i}')
    trainer = Trainer(accelerator='gpu',
                     strategy='ddp_notebook',
                     devices=2,
                     precision='32',
                     gradient_clip_val=None, 
                     logger=logger,
                     max_epochs=15,
                     enable_checkpointing=True,
                     enable_progress_bar=True,
                     enable_model_summary=False,
                     inference_mode=True,
                     default_root_dir='/kaggle/working/training_results',
                     num_sanity_val_steps=0)
    trainer.fit(model=lm, 
                train_dataloaders=DataLoader(folds[i][0], batch_size=1, num_workers=4, shuffle=True), 
                val_dataloaders=DataLoader(folds[i][1], batch_size=1, num_workers=4, shuffle=False))
    del model, lm, logger, trainer
    print(f'fold {i} completed....')

When used with 1x P100 GPU

Environment

Please go the kaggle notebook and run it

More info

No response

The text was updated successfully, but these errors were encountered:

KeesariVigneshwarReddy · 2024-12-31T06:50:28Z

May be there is some serious issue with strategy='ddp_notebook'

KeesariVigneshwarReddy added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Dec 29, 2024

github-actions bot added the ver: 2.4.x label Dec 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No error crash, just a never ending pause #20523

No error crash, just a never ending pause #20523

KeesariVigneshwarReddy commented Dec 29, 2024

KeesariVigneshwarReddy commented Dec 31, 2024

No error crash, just a never ending pause #20523

No error crash, just a never ending pause #20523

Comments

KeesariVigneshwarReddy commented Dec 29, 2024

Bug description

Issue - Training starts on 1x P100 GPU but it does not start on 2x T4 GPU

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

When used with 2x T4 GPUs

When used with 1x P100 GPU

Environment

More info

KeesariVigneshwarReddy commented Dec 31, 2024