-
-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AVX2 optimization of group lookup. #167
base: master
Are you sure you want to change the base?
Conversation
Wow, this is great, I really appreciate it @bashimao ! I did not investigate the AVX2 code, but the change looks very safe as it is conditioned with by a compilation flag. Did you see better performance when using AVX2 versus SSE2? |
@bashimao I'll check this evening after work how much this improves performance. |
@bashimao I did some performance testing and unfortunately I cannot see any performance improvement with the avx2 implementation. Maybe it would show only when the hash map is close to full (just before resizing). I'll do some more testing, but I'd like to know what you saw in your own testing. |
Hi, sorry for the delay. I am currently travelling overseas. I get back to you later this week with some benchmark results. |
Well, that is a now a big question. Very hard to answer. I admit, I may have been chasing a ghost here. I had a use-case, and there it was faster. But marginally. And here you can see why: This is is mostly your random insertion benchmark, but for reproductivity reasons this one only benchmarks the non-parallel As can be seen, we can achieve quite a decent performance without SSE already. SSE is the quickest, but not always. AVX2 occasionally overtakes, but only if you do not keep max occupancy at 87.5%. The reason being, that the AVX2 implementation reads 32 slots per group. If we limit to 7/8 (87.5%), each group we read contains 4 blanks on average. So the likelihood that we need to read another group is low. But the same is true if we use SSE where we read 16 slots per group, with an average blank rate of 2. So no real gain for the majority of reads. I think the amount of bytes read, doesn't really matter, because the cache line size of this CPU is anyway 64 byte. However, some of the AVX2 instructions have a slightly higher latency than their SSE counterparts, which I assume is the reason why it is slower. IMHO, that has to do with the larger register file in AVX2. Anyhow, the same AVX instructions will either be as fast as their SSE counterparts or a little bit slower. As you can see, the AVX2 implementation has the potential to overtake the SSE version. But only if we reduce to avg. 3 blanks per group, or avg. 2 blanks per group. As can be seen by the SSE implementation, having 2 blanks per group is simply enough. After all, we only need to find a single blank before we can stop searching. However, as you may see, the side-by-side comparison breaks down there. I need to increase the max occupancy of the hash table to achieve a lower blanks-per-group-ratio. In a way, that is nice, because we are making more efficient use of our memory cells with the AVX2 version. But because that also means that we expand the table at different times, the comparison breaks down. At certain sizes, the AVX2 implementation is certainly faster. But that it is incidental because the table has a different fill state at these times. I think merging this is fine (with some minor adjustments), but I would by default still select SSE. It is at least always faster than the non-optimized version. 2 things that I would want to explore next:
|
Actually, This could be the root cause of the issue. With 64 byte, any unaligned memory access will require hitting 2 cache lines. With a 32 byte lookup like SSE, you may - more often than not - get by with reading only a single cache line. I need to investigate. Maybe it is possible to align the group lookup. https://lemire.me/blog/2022/06/06/data-structure-size-and-cache-line-accesses/ |
Hey @bashimao , thank for following up with this and the hard work. I think the group lookup starts at the index determined by the hash, so I'm not sure how it can be aligned, and you are right that maybe the cost of accessing more memory for every lookup is higher than the benefit for the rare occasions where the match is not found in the first 16 slots. I'm really not very knowledgeable on SSE/AVX2 so I'm afraid I don't have any specific input, but kudos for the work and good luck finding a faster implementation. |
Actually, probably you can align the lookup, and discard the matches before the starting index, so you hit only one cache line. |
Thanks for sharing your concern. However, it turned out to be not that tricky. Hence, I managed to implement a memory aligned version. Timings didn't change on the AMD Epyc CPU. Assuming I did it correctly... that could suggest that cache-line access doesn't matter. Will try to compile and run on an Intel machine (with considerably less cache) next. In any case, I will learn something in the process. My private laptop "Intel Tigerlake" -- curiously -- also supports the AVX-512 + backported instructions. Will need to setup an environment though. So don't expect progress overnight. |
Good luck! |
Thanks for your help earlier today. We have been using your
flat_hash_map
in Merlin HugeCTR for quite some time now. I have implemented AVX2-accelerated groups for parallel-hashmap, which I hope you could absorb into the next release-version. Looking forward for your review.To replicate, compile with
cmake -DPHMAP_BUILD_TESTS=ON -DPHMAP_BUILD_EXAMPLES=ON -DCMAKE_CXX_FLAGS=-mavx2 ..
and run on a machine with AVX2 support.