Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector-times-matrix-transposed benchmark (V2) #40

Merged
merged 5 commits into from
Nov 2, 2023

Conversation

kuhar
Copy link
Collaborator

@kuhar kuhar commented Oct 25, 2023

Based on #38 by @qedawkins, and earlier mmt by @kuhar.

Add benchmarks for vmt, with very similar supporting structure to the existing mmt benchmark.

Changes compared to #38:

  • Drop alternative strategies
  • Pick a SIMD-contiguous access pattern
  • Add more problem size configurations
  • Measure performance in GB/s
  • Add comments

The performance depends heavily on the problem size. On 7900XTX, I'm seeing numbers up to 945 GB/s on 8k problem size.

qedawkins and others added 2 commits September 13, 2023 22:47
This adds benchmarks for `vmt`, with very similar supporting structure
to the existing `mmt` benchmark, but with different strategies tuned for
matvec. This add three strategies:

1) Treat it like a reduction with one workgroup per row, relying on
   cache to get reuse of the vector.
2) Copy the vector to shared memory using all threads in the workgroup
   and then process N0 rows per workgroup, with WG_Y | N0 threadgroups.
3) Use a fixed number of workgroups and each workgroup strides the
   full problem space. This should limit the overhead of setting up the
   vector in shared memory, as well as improves scheduling overhead.

Currently, the best configuration for each of the above three strategies
are in the same performance ballpark (~20us for a 4096 * 4096x4096 matvec
on an AMD 7900xtx).
* Drop alternative implementations
* Pick a SIMD-contigous access pattern
* Add more problem size configurations
* Measure performance in GB/s
* Add comments
benchmarks/vmt/vmt_i8.glsl Outdated Show resolved Hide resolved
@kuhar
Copy link
Collaborator Author

kuhar commented Oct 26, 2023

I added code to prefetch LHS and RHS in hope to hide latency. I'm seeing better numbers now:

-----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                     Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------------------------------------------------
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x8]/Workgroup[64x1x1]/manual_time          53.5 us         12.5 us        11920 Bytes=314.242G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time         58.0 us         15.7 us        12337 Bytes=289.814G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x8]/Workgroup[64x1x1]/manual_time          60.7 us         16.4 us        11211 Bytes=276.524G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time         58.1 us         12.8 us        10386 Bytes=289.079G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x8]/Workgroup[64x1x1]/manual_time          56.1 us         14.0 us        12489 Bytes=299.303G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time         53.4 us         11.8 us        11683 Bytes=314.342G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x8]/Workgroup[64x2x1]/manual_time          64.4 us         13.4 us         9679 Bytes=260.648G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time         61.3 us         12.4 us         9736 Bytes=274.182G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x8]/Workgroup[64x2x1]/manual_time          67.9 us         16.9 us        10330 Bytes=247.387G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time         51.1 us         11.5 us        10830 Bytes=328.514G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x8]/Workgroup[64x4x1]/manual_time          79.2 us         19.5 us         9463 Bytes=212.14G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time         59.2 us         11.0 us         9671 Bytes=283.908G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x8]/Workgroup[64x1x1]/manual_time          76.9 us         14.8 us         7301 Bytes=873.317G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time         85.7 us         18.8 us         7442 Bytes=783.545G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x8]/Workgroup[64x1x1]/manual_time          77.7 us         10.6 us         7309 Bytes=864.229G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time         76.4 us         10.1 us         7115 Bytes=879.334G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x8]/Workgroup[64x1x1]/manual_time          80.4 us         16.7 us         6536 Bytes=835.583G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time         83.8 us         18.4 us         7601 Bytes=801.437G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x8]/Workgroup[64x2x1]/manual_time           102 us         10.8 us         6059 Bytes=657.937G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time          110 us         15.9 us         6062 Bytes=609.361G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x8]/Workgroup[64x2x1]/manual_time           103 us         17.6 us         6073 Bytes=651.854G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time          104 us         11.5 us         6162 Bytes=647.435G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x8]/Workgroup[64x4x1]/manual_time           146 us         11.9 us         4441 Bytes=459.739G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time          149 us         13.8 us         4380 Bytes=451.568G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x8]/Workgroup[64x1x1]/manual_time         358 us         11.2 us         1935 Bytes=751.019G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time        355 us         15.1 us         1958 Bytes=756.893G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x8]/Workgroup[64x1x1]/manual_time         361 us         11.4 us         1920 Bytes=744.092G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time        357 us         12.9 us         1948 Bytes=752.493G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x8]/Workgroup[64x1x1]/manual_time         372 us         12.4 us         1860 Bytes=722.532G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time        369 us         11.0 us         1879 Bytes=727.972G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x8]/Workgroup[64x2x1]/manual_time         431 us         15.9 us         1338 Bytes=622.646G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time        425 us         11.1 us         1337 Bytes=631.244G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x8]/Workgroup[64x2x1]/manual_time         415 us         10.7 us         1356 Bytes=647.791G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time        415 us         11.6 us         1358 Bytes=646.999G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x8]/Workgroup[64x4x1]/manual_time         596 us         11.2 us          960 Bytes=450.851G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time        595 us         10.9 us          967 Bytes=451.092G/s

Load 128-bits at the time
@kuhar
Copy link
Collaborator Author

kuhar commented Oct 31, 2023

New numbers with increased load type:

Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time          54.3 us         12.0 us        10363 Bytes=309.157G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x32]/Workgroup[64x1x1]/manual_time          52.0 us         10.9 us        12351 Bytes=322.95G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x64]/Workgroup[64x1x1]/manual_time          50.8 us         11.1 us        12038 Bytes=330.542G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[1x128]/Workgroup[64x1x1]/manual_time         60.9 us         11.0 us        10596 Bytes=275.744G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time          52.2 us         12.1 us        12118 Bytes=321.931G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x32]/Workgroup[64x1x1]/manual_time          54.5 us         12.2 us        12197 Bytes=308.301G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x64]/Workgroup[64x1x1]/manual_time          54.3 us         12.5 us        12236 Bytes=309.169G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x128]/Workgroup[64x1x1]/manual_time         59.4 us         11.5 us        11041 Bytes=282.758G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time          52.9 us         12.4 us        12068 Bytes=317.288G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x32]/Workgroup[64x1x1]/manual_time          53.0 us         11.6 us        12137 Bytes=317.187G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x64]/Workgroup[64x1x1]/manual_time          60.0 us         17.5 us        12092 Bytes=279.866G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x128]/Workgroup[64x1x1]/manual_time         62.9 us         15.3 us        11022 Bytes=267.132G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time          58.7 us         10.5 us        10954 Bytes=286.225G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x32]/Workgroup[64x2x1]/manual_time          52.8 us         10.4 us        10776 Bytes=318.315G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x64]/Workgroup[64x2x1]/manual_time          53.0 us         10.6 us        10903 Bytes=316.809G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[2x128]/Workgroup[64x2x1]/manual_time         61.2 us         10.6 us         9103 Bytes=274.38G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time          63.3 us         15.6 us        10332 Bytes=265.485G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x32]/Workgroup[64x2x1]/manual_time          49.8 us         11.0 us        10735 Bytes=337.464G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x64]/Workgroup[64x2x1]/manual_time          55.1 us         10.4 us        11037 Bytes=305.028G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x128]/Workgroup[64x2x1]/manual_time         60.7 us         11.7 us         9199 Bytes=276.889G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time          60.1 us         11.1 us         9309 Bytes=279.383G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x32]/Workgroup[64x4x1]/manual_time          63.9 us         12.7 us         9296 Bytes=262.777G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x64]/Workgroup[64x4x1]/manual_time          68.9 us         15.1 us         9552 Bytes=243.732G/s
Radeon RX 7900 XTX/vmt[4096x4096]/i8->i32/Tile[4x128]/Workgroup[64x4x1]/manual_time         81.5 us         10.2 us         7080 Bytes=206.221G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time          84.9 us         18.5 us         7350 Bytes=790.845G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x32]/Workgroup[64x1x1]/manual_time          72.9 us         13.2 us         7557 Bytes=921.461G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x64]/Workgroup[64x1x1]/manual_time          72.6 us         11.4 us         7456 Bytes=925.018G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[1x128]/Workgroup[64x1x1]/manual_time         74.4 us         14.2 us         7521 Bytes=902.598G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time          75.9 us         10.1 us         7211 Bytes=885.057G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x32]/Workgroup[64x1x1]/manual_time          76.5 us         11.2 us         7392 Bytes=877.929G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x64]/Workgroup[64x1x1]/manual_time          78.3 us         14.9 us         7486 Bytes=857.515G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x128]/Workgroup[64x1x1]/manual_time         74.5 us         10.5 us         7314 Bytes=901.18G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time          73.1 us         10.9 us         7144 Bytes=918.353G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x32]/Workgroup[64x1x1]/manual_time          70.2 us         10.6 us         7656 Bytes=956.002G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x64]/Workgroup[64x1x1]/manual_time          75.3 us         10.2 us         7480 Bytes=892.07G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x128]/Workgroup[64x1x1]/manual_time         76.0 us         10.8 us         6602 Bytes=883.558G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time           105 us         17.0 us         5917 Bytes=636.596G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x32]/Workgroup[64x2x1]/manual_time          91.3 us         10.5 us         6277 Bytes=735.314G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x64]/Workgroup[64x2x1]/manual_time          92.7 us         10.2 us         6228 Bytes=724.199G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[2x128]/Workgroup[64x2x1]/manual_time          103 us         15.4 us         5939 Bytes=650.338G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time          96.8 us         10.6 us         5986 Bytes=693.685G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x32]/Workgroup[64x2x1]/manual_time          97.4 us         10.0 us         6361 Bytes=689.412G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x64]/Workgroup[64x2x1]/manual_time          93.8 us         10.4 us         6072 Bytes=716.008G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x128]/Workgroup[64x2x1]/manual_time         94.3 us         10.5 us         5966 Bytes=712.163G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time           143 us         10.2 us         4475 Bytes=468.694G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x32]/Workgroup[64x4x1]/manual_time           143 us         10.4 us         4431 Bytes=469.864G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x64]/Workgroup[64x4x1]/manual_time           148 us         11.3 us         4247 Bytes=453.095G/s
Radeon RX 7900 XTX/vmt[8192x8192]/i8->i32/Tile[4x128]/Workgroup[64x4x1]/manual_time          147 us         10.3 us         4150 Bytes=457.114G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x16]/Workgroup[64x1x1]/manual_time         351 us         10.2 us         1968 Bytes=764.448G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x32]/Workgroup[64x1x1]/manual_time         342 us         11.0 us         2047 Bytes=785.825G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x64]/Workgroup[64x1x1]/manual_time         340 us         10.9 us         2022 Bytes=790.284G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[1x128]/Workgroup[64x1x1]/manual_time        342 us         14.7 us         2001 Bytes=785.858G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x1x1]/manual_time         352 us         11.2 us         1952 Bytes=762.24G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x32]/Workgroup[64x1x1]/manual_time         344 us         10.9 us         2010 Bytes=781.62G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x64]/Workgroup[64x1x1]/manual_time         345 us         11.7 us         2020 Bytes=777.404G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x128]/Workgroup[64x1x1]/manual_time        343 us         11.7 us         2026 Bytes=782.997G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x1x1]/manual_time         367 us         11.8 us         1851 Bytes=732.426G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x32]/Workgroup[64x1x1]/manual_time         357 us         10.4 us         1926 Bytes=751.48G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x64]/Workgroup[64x1x1]/manual_time         348 us         11.9 us         1985 Bytes=772.219G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x128]/Workgroup[64x1x1]/manual_time        346 us         11.0 us         2012 Bytes=776.623G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x16]/Workgroup[64x2x1]/manual_time         423 us         10.9 us         1304 Bytes=634.515G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x32]/Workgroup[64x2x1]/manual_time         416 us         10.4 us         1337 Bytes=645.222G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x64]/Workgroup[64x2x1]/manual_time         419 us         10.5 us         1225 Bytes=640.945G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[2x128]/Workgroup[64x2x1]/manual_time        425 us         10.4 us         1262 Bytes=631.165G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x2x1]/manual_time         415 us         12.5 us         1276 Bytes=646.296G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x32]/Workgroup[64x2x1]/manual_time         407 us         10.4 us         1292 Bytes=659.612G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x64]/Workgroup[64x2x1]/manual_time         418 us         10.4 us         1285 Bytes=642.076G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x128]/Workgroup[64x2x1]/manual_time        427 us         10.9 us         1336 Bytes=629.435G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x16]/Workgroup[64x4x1]/manual_time         589 us         12.8 us          883 Bytes=456.136G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x32]/Workgroup[64x4x1]/manual_time         589 us         11.2 us          881 Bytes=455.663G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x64]/Workgroup[64x4x1]/manual_time         608 us         12.1 us          886 Bytes=441.749G/s
Radeon RX 7900 XTX/vmt[16384x16384]/i8->i32/Tile[4x128]/Workgroup[64x4x1]/manual_time        628 us         11.5 us          916 Bytes=427.491G/s

@kuhar
Copy link
Collaborator Author

kuhar commented Nov 2, 2023

@antiagainst @qedawkins I'm pretty happy with this implementation. Should we merge?

@qedawkins
Copy link
Collaborator

@antiagainst @qedawkins I'm pretty happy with this implementation. Should we merge?

Works for me, can I give it a pass tomorrow first?

Copy link
Collaborator

@qedawkins qedawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the pathfinding!

@kuhar kuhar merged commit 3049af9 into google:main Nov 2, 2023
5 checks passed
@oscarbg
Copy link

oscarbg commented Nov 7, 2023

Hi,
sorry to ask here.. but what's special about RDNA3 in this test, as I can't run this sample on Nvidia 4070:

~/code/uVkCompute/build/benchmarks/vmt
./vmt_rdna3
2023-11-07T17:08:45+01:00
Running ./vmt_rdna3
Run on (32 X 5881 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x16)
L1 Instruction 32 KiB (x16)
L2 Unified 1024 KiB (x16)
L3 Unified 32768 KiB (x2)
Load Average: 8.08, 5.68, 2.31
WARNING CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
WARNING Library was built as DEBUG. Timings may be affected.
code/uVkCompute/benchmarks/vmt/vmt_main.cc:123: check error: destination buffer element (0) has incorrect value: expected to be 1404 but found -1
^ In shader: Tile[1x16], i8->i32
Abortado (`core' generado)

@kuhar
Copy link
Collaborator Author

kuhar commented Nov 7, 2023

@oscarbg noting as of today, you can see the GSL compile target here: 3049af9#diff-62da6f62b4091626b341c9d8333d332aee35c053ff57cacebbb57792b987702aR30

This is more to communicate that it has been tuned and tested on rdna3, and in the future we may add more target-specific options to GLSL.

@kuhar
Copy link
Collaborator Author

kuhar commented Nov 7, 2023

code/uVkCompute/benchmarks/vmt/vmt_main.cc:123: check error: destination buffer element (0) has incorrect value: expected to be 1404 but found -1 ^ In shader: Tile[1x16], i8->i32 Abortado (`core' generado)

@oscarbg also this indicates that one of the assumptions made in the GLSL does not hold on this target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants