Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training in C++? #8

Open
meet-minimalist opened this issue Jan 16, 2021 · 5 comments
Open

Training in C++? #8

meet-minimalist opened this issue Jan 16, 2021 · 5 comments
Labels
question Further information is requested

Comments

@meet-minimalist
Copy link

Hey, you have showed such an amazing work to train NNs in C++.
I would like to know what are the reasons for which you started training models in C++ instead of python? Because once the model definition has been written in pytorch in python and data pipeline has been setup, all the computation needs to be done will be done on GPU. So there wont be much drastic performance gains when migrating from python to c++. Please tell me some of your thoughts on this.

@koba-jon
Copy link
Owner

koba-jon commented Jan 16, 2021

I was interested in implementing the NN programs in C++, and I want to improve my coding ability in C++, so I decided to write this code.
However, I have investigated how mush the speed is different between Python and C++.

I found a strange result that there are cases in which it runs faster in Python than in C++.
Here, the batch size, image size, and almost all components are matched between Python and C++.
And, NN model used for training and test in C++:
https://github.com/koba-jon/pytorch_cpp/tree/master/Dimensionality_Reduction/AE2d
My article for details (Japanese only):
https://qiita.com/koba-jon/items/274e5e4970da72216f73

CPU (Core i7-8700) only with GPU (GeForce GTX 1070)
Python C++ Python C++
nondeterministic(cudnn) deterministic(cudnn)
Training times[time/epoch] 1h04m49s 1h03m00s 5m53s 7m42s 17m36s
GPU memory[MiB] 2 9 933 913 2941
Testing speed[seconds/data] 0.01189 0.01477 0.00102 0.00101 0.00101

Training in C++ is slow when the GPU is used.
It has been identified that the causes of the delay are the "forward" and "backward" part, so it's not the part I wrote.
This speed is faster than when it is CPU only, so it seems that CUDA is being used.
But, I guess the coding of PyTorch for using the GPU may be different between Python and C++.

I have heard reports that the speed in C++ is faster when training only fully connected layers.
I plan to investigate this matter around March.

@meet-minimalist
Copy link
Author

Huge thanks to you for these interesting insights. I suspect the reason for high training time on C++ could be under optimized data pipeline or under-optimized-and-serialized CPU-to-GPU and GPU-to-CPU data copy instead of parallel async copy. But, still further investigation might help. I was also curious about this as many people used to train models on C++ and I wonder what on earth forced them to do this. :-P

@koba-jon
Copy link
Owner

koba-jon commented Mar 29, 2021

This is a follow-up report.

I benchmarked again using the following three kinds of Neural Networks in PyTorch v1.8.0.
I measured the training time based on iteration per second.

My article for details (Japanese only): https://qiita.com/koba-jon/items/59a64c6ec38ac7286d6b

  1. Only Fully Connected Layers
    Model: AE1d
CPU(Core i7-8700) GPU(GeForce GTX 1070)
cudnn: deterministic cudnn: non-deterministic
Python[iteration/s] 86.83 97.69 97.69
C++[iteration/s] 312.6 312.6 312.6
Speed Up (Python -> C++) ×3.6 ×3.2 ×3.2
  1. Only Convolutional Layers
    Model: Discriminator
CPU(Core i7-8700) GPU(GeForce GTX 1070)
cudnn: deterministic cudnn: non-deterministic
Python[iteration/s] 5.24 27.59 39.08
C++[iteration/s] 4.51 26.8 36.08
Speed Up (Python -> C++) ×0.86 ×0.97 ×0.92
  1. Convolutional and Transposed Convolutional Layers
    Model: AE2d
CPU(Core i7-8700) GPU(GeForce GTX 1070)
cudnn: deterministic cudnn: non-deterministic
Python[iteration/s] 1.14 9.56 14.39
C++[iteration/s] 1.05 9.16 13.44
Speed Up (Python -> C++) ×0.92 ×0.96 ×0.93

As above, compared to before, the speed of "AE2d" in C++ is much faster and improved.
However, it still couldn't beat the execution speed of Python.

Looking at the details, it seems that the convolutional layer and the transposed convolutional layer are bad.
However, in the case of only Fully Connected Layers, execution speed in C++ is much faster than Python.
This alone may be worth training in C++.

I look forward to future improvements in the PyTorch C++ API for models of "2" and "3".

@meet-minimalist
Copy link
Author

Thanks a lot for such a detailed experiments. One more thing that I would like to share to you that I have recently discovered, when transferring the training data from RAM to GPU memory, people generally use the concept of Pinned memory and that is a designated area of RAM from which memory copy into GPU memory is faster. I have seen this while working with TensorRT related operations in C++ where they allocate an input tensor memory on pinned memory area and once the data is there on this pinned memory they will call memcpy command to copy data from this pinned memory area to GPU for further computation. This may again give you some boost in C++ timings.

PS : Please create a paper / medium article of your findings along side this qiita.com blog. Because world needs to know about your findings. Keep it up.

@koba-jon
Copy link
Owner

Thank you for sharing such information.
I follow the page below, and I will try to improve class "dataloader".
https://pytorch.org/docs/stable/data.html#memory-pinning

Please look forward to follow-up report.

@koba-jon koba-jon added the question Further information is requested label Nov 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants