-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add vector-times-matrix-transposed benchmark (V2) #40
Conversation
This adds benchmarks for `vmt`, with very similar supporting structure to the existing `mmt` benchmark, but with different strategies tuned for matvec. This add three strategies: 1) Treat it like a reduction with one workgroup per row, relying on cache to get reuse of the vector. 2) Copy the vector to shared memory using all threads in the workgroup and then process N0 rows per workgroup, with WG_Y | N0 threadgroups. 3) Use a fixed number of workgroups and each workgroup strides the full problem space. This should limit the overhead of setting up the vector in shared memory, as well as improves scheduling overhead. Currently, the best configuration for each of the above three strategies are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec on an AMD 7900xtx).
* Drop alternative implementations * Pick a SIMD-contigous access pattern * Add more problem size configurations * Measure performance in GB/s * Add comments
I added code to prefetch LHS and RHS in hope to hide latency. I'm seeing better numbers now:
|
Load 128-bits at the time
New numbers with increased load type:
|
@antiagainst @qedawkins I'm pretty happy with this implementation. Should we merge? |
Works for me, can I give it a pass tomorrow first? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the pathfinding!
Hi, ~/code/uVkCompute/build/benchmarks/vmt |
@oscarbg noting as of today, you can see the GSL compile target here: 3049af9#diff-62da6f62b4091626b341c9d8333d332aee35c053ff57cacebbb57792b987702aR30 This is more to communicate that it has been tuned and tested on rdna3, and in the future we may add more target-specific options to GLSL. |
@oscarbg also this indicates that one of the assumptions made in the GLSL does not hold on this target. |
Based on #38 by @qedawkins, and earlier mmt by @kuhar.
Add benchmarks for
vmt
, with very similar supporting structure to the existingmmt
benchmark.Changes compared to #38:
The performance depends heavily on the problem size. On 7900XTX, I'm seeing numbers up to 945 GB/s on 8k problem size.