v3.3.5 - Assembly Local Summation Splitting
Assembly
- LocalSplitU implemented. Improves performance when M*N is small by breaking a workgroup up into subgroups which work on the same tile, reducing the partial summation into local memory, then writing the resulting summation to C. LocalSplitU can be combined with GlobalSplitU.
Other
- Added deepbench.yaml so Tensile can be tuned for Baidu's DeepBench gemm sizes.
- alpha/beta values can be controlled for validation
- fixed thread safety of looking up assembly kernels
- fixed associating an assembly kernel's module with a specific device