Skip to content

v3.3.5 - Assembly Local Summation Splitting

Compare
Choose a tag to compare
@guacamoleo guacamoleo released this 01 Nov 19:22
· 4258 commits to master since this release
022743c

Assembly

  • LocalSplitU implemented. Improves performance when M*N is small by breaking a workgroup up into subgroups which work on the same tile, reducing the partial summation into local memory, then writing the resulting summation to C. LocalSplitU can be combined with GlobalSplitU.

Other

  • Added deepbench.yaml so Tensile can be tuned for Baidu's DeepBench gemm sizes.
  • alpha/beta values can be controlled for validation
  • fixed thread safety of looking up assembly kernels
  • fixed associating an assembly kernel's module with a specific device