num_labeled in DistributedDataParallel #5

LiheYoung · 2020-03-27T09:12:23Z

When using DistributedDataParallel, if N labeled training images and K GPUs are used, should we set num_labeled = N / K instead of N? since np.random.shuffle(idx) generate different idxs in different threads.

bkj · 2020-09-17T23:05:49Z

I think @LiheYoung is correct -- w/ DistributedDataParallel you launch N copies of the program. If you don't set the seed, then np.random will sample the dataset differently, and you end up w/ roughly 4 * N more samples than you're expecting.

This should be easy to test by running w/o a seed and w/ a seed -- if I'm right, using a seed will reduce the performance substantially. Running this experiment now, will report results here.

bkj · 2020-09-18T00:46:28Z

In my experiment, performance w/o the seed is substantially better than w/ a seed. I only ran once, so perhaps this is random variation, but I'm guessing this is due to the issue @LiheYoung and I pointed out above.

Red line is w/o seed, blue w/ seed.

Edit: This is for

python -m torch.distributed.launch --nproc_per_node 4 train.py
    --dataset        cifar10           
    --num-labeled    250               
    --arch           wideresnet        
    --batch-size     16                
    --lr             0.03

chongruo · 2020-10-12T08:10:18Z

I think @LiheYoung is correct -- w/ DistributedDataParallel you launch N copies of the program. If you don't set the seed, then np.random will sample the dataset differently, and you end up w/ roughly 4 * N more samples than you're expecting.
This should be easy to test by running w/o a seed and w/ a seed -- if I'm right, using a seed will reduce the performance substantially. Running this experiment now, will report results here.

That's right. I need to tell you not to use a seed.

Sorry, a little confused. Should we set the seed?

I think @LiheYoung is correct. With K gpus, the actual number of labeled data is K*N, rather than N. So we should set labeled number to N/K, or set the same seed for all gpus?

chongruo · 2020-10-19T16:34:28Z

FixMatch-pytorch/dataset/cifar.py

Lines 102 to 106 in 10db592

    
           for i in range(num_classes): 
        
               idx = np.where(labels == i)[0] 
        
               np.random.shuffle(idx) 
        
               labeled_idx.extend(idx[:label_per_class]) 
        
               unlabeled_idx.extend(idx[:])

For each GPU, the corresponding process will create a CIFAR dataset. Since we don't set the fixed seed, the idx is shuffled (line 104) in different ways on different GPUs, which results in having more labeled samples.

zhifanwu · 2020-11-21T07:00:29Z

I think @LiheYoung is correct -- w/ DistributedDataParallel you launch N copies of the program. If you don't set the seed, then np.random will sample the dataset differently, and you end up w/ roughly 4 * N more samples than you're expecting.
This should be easy to test by running w/o a seed and w/ a seed -- if I'm right, using a seed will reduce the performance substantially. Running this experiment now, will report results here.

That's right. I need to tell you not to use a seed.

Sorry, a little confused. Should we set the seed?

I think @LiheYoung is correct. With K gpus, the actual number of labeled data is K*N, rather than N. So we should set labeled number to N/K, or set the same seed for all gpus?

That's right. If you print the idxs, you will find different idxs are generated K times, so actually the labeled data are K times than you set. So we should set labeled number to N/K, or set the same seed for all gpus, period.

zhifanwu · 2020-11-21T07:04:26Z

I think @LiheYoung is correct -- w/ DistributedDataParallel you launch N copies of the program. If you don't set the seed, then np.random will sample the dataset differently, and you end up w/ roughly 4 * N more samples than you're expecting.
This should be easy to test by running w/o a seed and w/ a seed -- if I'm right, using a seed will reduce the performance substantially. Running this experiment now, will report results here.

That's right. I need to tell you not to use a seed.

I think there is a bug in the implementation of DDP, see above discussions, please.

kekmodel · 2020-12-11T02:03:25Z

Will it be solved by using a seed?

moucheng2017 · 2022-04-28T14:36:18Z

FixMatch-pytorch/dataset/cifar.py

Lines 102 to 106 in 10db592

for i in range(num_classes):

idx = np.where(labels == i)[0]

np.random.shuffle(idx)

labeled_idx.extend(idx[:label_per_class])

unlabeled_idx.extend(idx[:])

For each GPU, the corresponding process will create a CIFAR dataset. Since we don't set the fixed seed, the idx is shuffled (line 104) in different ways on different GPUs, which results in having more labeled samples.

Is it necessary to use 4GPUs to reproduce the results with 40 labels?

moucheng2017 · 2022-04-28T14:37:15Z

In my experiment, performance w/o the seed is substantially better than w/ a seed. I only ran once, so perhaps this is random variation, but I'm guessing this is due to the issue @LiheYoung and I pointed out above.

Red line is w/o seed, blue w/ seed.

Edit: This is for
python -m torch.distributed.launch --nproc_per_node 4 train.py
    --dataset        cifar10           
    --num-labeled    250               
    --arch           wideresnet        
    --batch-size     16                
    --lr             0.03

Hey! I am wondering if you could reproduce the results with 1 GPU?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

num_labeled in DistributedDataParallel #5

num_labeled in DistributedDataParallel #5

LiheYoung commented Mar 27, 2020

bkj commented Sep 17, 2020

bkj commented Sep 18, 2020 •

edited

Loading

chongruo commented Oct 12, 2020 •

edited

Loading

chongruo commented Oct 19, 2020

zhifanwu commented Nov 21, 2020

zhifanwu commented Nov 21, 2020 •

edited

Loading

kekmodel commented Dec 11, 2020 •

edited

Loading

moucheng2017 commented Apr 28, 2022

moucheng2017 commented Apr 28, 2022

num_labeled in DistributedDataParallel #5

num_labeled in DistributedDataParallel #5

Comments

LiheYoung commented Mar 27, 2020

bkj commented Sep 17, 2020

bkj commented Sep 18, 2020 • edited Loading

chongruo commented Oct 12, 2020 • edited Loading

chongruo commented Oct 19, 2020

zhifanwu commented Nov 21, 2020

zhifanwu commented Nov 21, 2020 • edited Loading

kekmodel commented Dec 11, 2020 • edited Loading

moucheng2017 commented Apr 28, 2022

moucheng2017 commented Apr 28, 2022

bkj commented Sep 18, 2020 •

edited

Loading

chongruo commented Oct 12, 2020 •

edited

Loading

zhifanwu commented Nov 21, 2020 •

edited

Loading

kekmodel commented Dec 11, 2020 •

edited

Loading