-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to launch the benchmarks ? #120
Comments
This works on the computer I run them on. The particular error you cited is from trying to compile code wrapping MKL. So I'll replace that part with MKL_jll, since I believe that should work for everyone. Next comes the compilers. I guess I may as well do the same for |
Hi Chris, first I use this occasion to express again how impressive I find your work on LoopVectorization.jl ! I am currently trying to write a package called BTL.jl that aims to performs benchmarks for a large number of micro-kernels (dot, axpy, gemm, copy, permutation,...) implemented with different language (Julia, C, C++, Fortran,...), different implementations for each language, different compilers, (different options), and different machines. I used to do this from C++ but I think Julia is a perfect candidate for this task. For example, one would have access to the generated assembly and source code for each performance curves. All the measurements data will be stored on a database server and I have started to write a JS client to browse this database. The objective of this project is multiple
I use this kind of performance analysis to teach HPC (and incidentally why Julia is so powerful). I use to do inside for my previous company but now is a good time to go open source :) I have experienced some problem with BenchmarkTools which consumes more and more memory while the different benchmarks are performed... I haven't exactly understood in detail why (although François explained this to me twice ;). So, looking at your superb performance graphs I thought that I could learn how to do it properly on Julia ;). I guessed that you use Distributed in order to speed up the measurements but It may also be a proper way to release the memory allocated by @belapsed. For Intel's compilers, I think that you can obtain free licenses being an OS developer. Let me know if you have problem to get licenses. An easy installation would be key to gather data from many users and improve the machine dimension. I thought that a docker distribution for this purpose maybe appropriate (not having to install all the prerequisite manually). |
Thanks. It is my intention for LoopVectorization's macrokernels to be optimal, or very nearly so -- and by extension, the microkernels as well. In particular, I'm curious about prefetching at the moment. In rare cases, LoopVectorization will insert prefetches. The prefetch distance has been tuned to my own computer, but I probably need a much larger sample to pick a good way of doing this. BTL.jl sounds like an interesting project.
Huh, I never noticed that. I did not deliberately add any workarounds. Is there a leak, or why doesn't the GC take care of it?
My benchmark script is a bit of a mess, so I wouldn't follow it too closely ;).
The computer has 18 cores, so I run them on 17. Maybe the extra noise of going with 18 would be fairly negligible. Cache conflicts might be an issue -- but I assume in "real world" use cases where you care about performance, you'll run into those sorts of problems anyway. Also, you may want to set
That'd be a good idea. |
well, you may try the following MWE and check the memory consumption (even after go() function ends)
I guess you are right. But the docker image could include all other libs and OS compilers (e.g. Eigen). BTW I think that there is room for improvement for interfacing Eigen with aligned Eigen's Map on Julia's Ptr I should try to make a PR. |
I can verify -- memory consumption was high. Repeating I also noticed that by the time the benchmarks finished, each process consumed about 0.9% of the 128GiB of RAM, which means they consumed about 20 GiB total. Maybe the script should kill workers and add replacements between benchmarks? You should be able to go to activate and instantiate the Regarding Eigen alignment, improvements are appreciated. It was only recently that I got it to start actually using AVX512, which dramatically boosted performance in some benchmarks when compiling with But on alignment, beware that small Julia arrays are not necessarily aligned: julia> reinterpret(Int, pointer(rand(2,2))) % 64
0
julia> reinterpret(Int, pointer(rand(2,2))) % 64
48
julia> reinterpret(Int, pointer(rand(2,2))) % 64
48
julia> reinterpret(Int, pointer(rand(2,2))) % 64
16 Regarding performance, my understanding was largely based on this blog post. Meaning that |
Thanx, I think that you are right about alignment negligible impact for Eigen. I have also noticed that Eigen does not perform very well with Icc and clang and very well with gcc. I "succeed" on a first launch of the benchmark, but it dies after the first table...
|
Wow, my eigen benchmark results are probably all wrong. I pushed the fix. You should delete "libetest.so", because this file is supposed to be compiled by g++, but the bug replaced the g++ version with clang. Then you should be able to run the benchmarks. |
Hi again, I probably do something wrong, I did make a pull this morning but I get the same error
|
Do you have LoopVectorization checked out for development? Or did you (EDIT: It looked for the shared library in You can look inside
See here for the bad code. It checked for The master branch on the other hand correctly creates the file |
Thanx, Concerning BTL.jl jhe memory consumption of BenchmarkTools is related to the size of the arguments. I have problems because I use very large arrays. As a workaround, I can simply make a switch between I will send you the JLD2 of the results. Also, one should advise that the benchmark to be carried during winter in order to make a useful use of the generated heat ;) |
I issued a new release yesterday that should improve AVX2 performance, mostly in the matrix multiplication benchmarks.
I don't see an issue with BenchmarkTools. Am I missing one? If not, you should open one with that minimal example.
CpuId.jl will only work on Intel and AMD, but that covers most.
Ha. |
Hi Chris, I think that this issue corresponds to the problem BTW, this memory issue is also a problem here because the LoopVectorization benchmarks also exhausted my memory machine 'nearly 16GB at the end). I should be able to send you a JLD2 for my machine tomorrow. |
Hi again, I have made a simple experiment that shows how to release the memory allocated (and not deallocated) by BenchmarkTools measurements. Doing so I learn a little more about Distributed.jl I need two files. The main one that should be included by the user
And a second one (here) named
You can check that the memory usage does not increase during the loop on the functions. |
Thank you, I'll update the benchmarks to follow that approach soon. |
The benchmarks now start and remove workers for each benchmark. That should stop the memory from growing over time. |
Great ! I think that you forgot to commit setup_worker.jl |
Oops, just added it! |
Sorry if I missed something obvious but I get
|
That's what I get for not rerunning all of them locally. I fixed that in the last commit. |
Thanks ! It seems to work now. I happen to have an issue with pango but it is not important and probably related with my local environment. I think that it may be safer to save the results (JLD2) before plotting them. I will sent you the results when it is done. |
BTL, the |
Hi,
I have downloaded LoopVectorization source and all the tests pass.
I would like to launch the benchmarks (include("benchmark/driver.jl")) and I fail to adapt directcalljit.f90
which contains hard paths
got and error
Bad # preprocessor line #if defined MKL_DIRECT_CALL_SEQ_JIT
What is the correct way to launch the benchmarks ?
The text was updated successfully, but these errors were encountered: