-
Notifications
You must be signed in to change notification settings - Fork 585
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert DG-RePlAce algorithm to Kokkos #5352
base: master
Are you sure you want to change the base?
Convert DG-RePlAce algorithm to Kokkos #5352
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clang-tidy made some suggestions
There were too many comments to post at once. Showing the first 25 out of 52. Check the log or trigger a new build to see more.
// OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | ||
/////////////////////////////////////////////////////////////////////////////// | ||
|
||
#include "gpl2/MakeDgReplace.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'gpl2/MakeDgReplace.h' file not found [clang-diagnostic-error]
#include "gpl2/MakeDgReplace.h"
^
// | ||
/////////////////////////////////////////////////////////////////////////////// | ||
|
||
#include <Kokkos_Core.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]
#include <Kokkos_Core.hpp>
^
// | ||
// | ||
/////////////////////////////////////////////////////////////////////////////// | ||
#include <Kokkos_Core.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'Kokkos_Core.hpp' file not found [clang-diagnostic-error]
#include <Kokkos_Core.hpp>
^
/////////////////////////////////////////////////////////////////////////////// | ||
#include <Kokkos_Core.hpp> | ||
|
||
void dct_2d_fft(const int M, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: parameter 'M' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
void dct_2d_fft(const int M, | |
void dct_2d_fft(int M, |
#include <Kokkos_Core.hpp> | ||
|
||
void dct_2d_fft(const int M, | ||
const int N, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: parameter 'N' is const-qualified in the function declaration; const-qualification of parameters only has an effect in function definitions [readability-avoid-const-params-in-decls]
const int N, | |
int N, |
binCntY_ = 512; | ||
} | ||
|
||
binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]
src/gpl2/src/placerBase.cpp:40:
- #include <cstdio>
+ #include <cmath>
+ #include <cstdio>
binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_); | |
binSizeX_ = std::ceil(static_cast<float>((ux_ - lx_)) / binCntX_); |
} | ||
|
||
binSizeX_ = ceil(static_cast<float>((ux_ - lx_)) / binCntX_); | ||
binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: call to 'ceil' promotes float to double [performance-type-promotion-in-math-fn]
binSizeY_ = ceil(static_cast<float>((uy_ - ly_)) / binCntY_); | |
binSizeY_ = std::ceil(static_cast<float>((uy_ - ly_)) / binCntY_); |
#include <string> | ||
#include <vector> | ||
|
||
#include "db_sta/dbNetwork.hh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: 'db_sta/dbNetwork.hh' file not found [clang-diagnostic-error]
#include "db_sta/dbNetwork.hh"
^
int64_t nesterovInstsArea() const | ||
{ | ||
return stdInstsArea_ | ||
+ static_cast<int64_t>(round(macroInstsArea_ * targetDensity_)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: call to 'round' promotes float to double [performance-type-promotion-in-math-fn]
src/gpl2/src/placerBase.h:38:
- #include <memory>
+ #include <cmath>
+ #include <memory>
+ static_cast<int64_t>(round(macroInstsArea_ * targetDensity_)); | |
+ static_cast<int64_t>(std::round(macroInstsArea_ * targetDensity_)); |
/////////////////////////////////////////////////////////////// | ||
// Instance | ||
Instance::Instance() | ||
: inst_(nullptr), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warning: member initializer for 'inst_' is redundant [modernize-use-default-member-init]
: inst_(nullptr), | |
: , |
Earlier it was reported the runtime difference to be minimal but 0:57.70 vs 1:33.49 is more substantial. Is this expected? |
Earlier measurements were done when some parts was still using native CUDA and using different design ( I'd expect, it should be possible to achieve similar runtime using Kokkos, This results might suggest, that there are some unnecessary memory copies between host/device, but this needs to be investigated further. |
Please try to get a more precise measure of the runtime difference as this is important in deciding whether Kokkos is a good alternative to direct CUDA coding. Do all the various versions produce the same result? That is also important. |
What was the thinking behind making kokkos a dependency but kokkos-fft a submodule? It seems like they could both be build dependencies (and added to the DependencyInstaller with an option). |
I think I would say direct CUDA coding isn't really a viable option. I would be personally opposed to its inclusion. I think Kokkos or something like it is the only viable path forward. The runtime differences don't look significant if you compare it to the overall speedup achieved. We're going for a pragmatic path forward, and to me this meets my bar for the goals we set out.
Agree that this is important to check. We may need to order the floats to get identical/sufficiently similar results. |
You personally pushed for the inclusion of gpuSolver.cu and said its was valuable as a template for future development. Shall we delete it? I was never in favor. A 50% overhead is worth exploring to at least understand if not eliminate. |
I think that seems like the right move at this point. With more time and context I don't think it's viable for us to maintain two codebases.
+1 I just want to point out if this is the fastest we could go that seems fast enough for me. |
No they don't and it was quite surprising, as I expected that original code and Kokkos with CUDA backend will produce the same result. NVCC should do pre-processing and compilation for device code and produce CUDA binary and it should leave host code for host compiler. We checked that when I suspect that this issue isn't only related to Eigen: when I disabled initial placement, runtime of Kokkos and original code were almost the same, but results were still different (I haven't investigated reason for this).
kokkos-fft is header only interface library that translates FFT calls into proper backend by detecting enabled backends in Kokkos, but I agree, if preferred, both kokkos and kokkos-fft could be dependencies.
I think this overhead is due to different initial placement, when initial placement is disabled runtime is very similar:
I also did precise measurements using RTX 3080, 8 vCPU i9-12900 @ 2.42 GHz and 32GB of RAM with 10 runs using
|
Thanks for the analysis. It would be good to get to the bottom of the difference as it will make regression testing hard otherwise. Is |
Arguments that are passed to |
another possibility is that it is invoking a different g++ binary from another path |
Converted to a draft due to no progress. |
Signed-off-by: ZhiangWang033 <[email protected]>
Co-authored-by: Kamil Rakoczy <[email protected]> Co-authored-by: Jan Bylicki <[email protected]> Signed-off-by: Krzysztof Bieganski <[email protected]> Signed-off-by: Kamil Rakoczy <[email protected]> Signed-off-by: Jan Bylicki <[email protected]>
04d428f
to
925dd93
Compare
I've rebased this branch onto latest
I've found that to not be the case. Early, I've recreated the same condition (where Eigen was running slowly) using
To prioritize merging of GPU-accelerated placement, the focus was to get the branch issue-free before optimizing. In my testing, Kokkos-based algorithm on Future / subsequent work:
|
I added a configuration option to |
I would prefer to see kokkos as part of the dependency installer rather than as a submodule. There should be no need to compile it for each workspace on a machine. |
With the current setup, it would be possible to support both compilation schemes, with the priority set towards the |
If someone wants to put a local copy in-tree that's fine but I'd like to avoid having a submodule. |
I'll add support for |
Signed-off-by: Jan Bylicki <[email protected]>
Signed-off-by: Jan Bylicki <[email protected]>
072e3b1
to
2dcac77
Compare
This MR converts DG-RePlAce algorithm that was originally written for CUDA to Kokkos.
Kokkos provides abstraction for writing parallel code that can be translated into several backends including CUDA, OpenMP and C++ threads.
Tested on single run with RTX 3090 and i7-8700 CPU @ 3.20GHz using
ariane133
design.