Releases: ROCm/Tensile
v3.3.7 Bug Fix for gfx803 ISA
Bug Fixes:
- changed v_add_i32 to v_add_u32 as former isn't part of gfx803.
Removed Hard-Coded Paths and Fixed Library Logic Analyzer
Bug Fixes:
-
Removed hard-coded paths (/opt/rocm/bin/) for hcc, rocm-smi, rocm_agent_enumerator so that Tensile will work even if users have installed ROCm to an alternative paths.
-
In library logic generator:
- For benchmarks where M=N=K there was a bug in generating initial selection rules since a problem size may not have any valid solution. In these cases the analyzer just moves on to the next problem sizes to find a winner.
- Where different groups of solutions were benchmarked for different groups of problem sizes, the winner selector correctly assign scores of infinity to represent that a solution cannot be used for a problem size since it wasn't benchmarked for a problem sizes. This fixed the M<4 bug in rocBLAS::gemm.
- The secondary winner was only being updated when the winner was, then removing unimportant solution was based on speedup of winners vs secondaries. This has been corrected and is important for SolutionImportanceMin>0.
v3.3.5 - Assembly Local Summation Splitting
Assembly
- LocalSplitU implemented. Improves performance when M*N is small by breaking a workgroup up into subgroups which work on the same tile, reducing the partial summation into local memory, then writing the resulting summation to C. LocalSplitU can be combined with GlobalSplitU.
Other
- Added deepbench.yaml so Tensile can be tuned for Baidu's DeepBench gemm sizes.
- alpha/beta values can be controlled for validation
- fixed thread safety of looking up assembly kernels
- fixed associating an assembly kernel's module with a specific device
v3.2.3 - Assembly Global Summation Splitting
Assembly:
- GlobalSplitU implemented. Improves performance when M*N is too small to fill the GPU with workgroups.
- Improved implementation of "division by invariant multiplication" so that ThreadTiles, WorkGroups, and DepthUs can all be non-powers of 2.
- ~100,000 assembly kernel are working for each of the 4 transpose cases.
- LocalSplitU is the only remaining kernel feature not implemented for sgemm.
Other:
- DepthU < 0 represents key values, such as each thread loads at least 1 vector from global memory.
- SleepPercent global parameter will cause benchmarking to sleep after each data point, to give the gpu time to cool off, prevent overheating, and ensure a more uniform benchmarking process. SleepPercent = 200 means sleep time will be kernel_time * 200%.
v3.2.0 - Assembly for gfx803 and gfx900 sgemm
Assembly Kernels:
- Can produce assembly kernels for AMD architectures gfx803 and gfx900. The assembly generator supports single precision gemm and batched gemm.
- GlobalSplitU and LocalSplitU are the only sgemm features not yet implemented.
- On both architectures, these kernels can achieve 94% efficiency for a few problem sizes and 90% efficiency for many problem sizes.
- See Tensile/Configs/sgemm_asm.yaml for examples.
- Source and assembly kernels can be benchmarked with the solution parameter KernelLanguage: ["Source", "Assembly"].
Bug Fixes:
- Prior versions had bugs handling N-dimensional tensors, this release once again supports tensors of any dimensions. See Tensile/Configs/tensor_contraction.yaml and convolution.yaml for examples.
Other:
- The benchmarking clients permit more command line parameters for easier use.
- Clients also print gpu's current clock speeds as well as temperature in order to verify that all benchmarking is being performed consistently.
v3.0.4 - Fixed NaN propagation
When Beta==0, kernels write to C tensor without reading from it.
v3.0.0 - GlobalSplitU and Improved Benchmarking / Library Logic
GlobalSplitU: On top of LocalSplitU, Tensile now supporting splitting up the summation between work-groups. This option requires a beta-only kernel followed by a gemm kernel which uses atomic compare-and-swap to accumulate results in global memory. This feature increases the number of work-groups while maintaining tile size, with the drawback of slower global memory accumulation.
Improved Benchmarking / Library Logic:
- Users can perform multiple benchmark runs for a single problem type; this allows for tuning multiple problem size groups.
- Users can specify multiple problem size ranges as well as exact sizes to do training and logic generation for.
- Users can label a benchmark with a schedule name and a list of devices which the schedule supports. Tensile will choose solution schedule based on device.
Semantic Versioning: Users can specify minimum Tensile versions in yaml files to guarantee support & compatibility.
Expanded Work-Group and Thread-Tile Sizes: Users can explicitly specify work-group sizes and thread-tile sizes which are not powers of 2 and not even even.
Maximum Occupancy: For problem sizes or strides which are known to thrash the gpu caches, users can manually lower the occupancy of the work-groups to try and improve performance.
v2.4.5 - Prefetching and Half-Precision
Prefetch Global -> Local
Issues loads from global memory into lds memory one full iteration in advance. This uses double the lds but hides global memory latency better.
Prefetch Local -> Registers
Issues loads from lds into registers one unrolled iteration in advance. This uses several extra registers but hides lds latency better.
Half-Precision
"half"/__fp16 is now a supported data type.
TensileBenchmarkLibraryClient.py
This python script takes a library client executable and a csv file of data sizes as inputs, and will run the executable on the data sizes.
v2.3.0 - Short-Vectors and Pointer-Shifting
Short-Vectors:
The kernels can now operate on float2* or float4* pointers which means the reads and writes to memory are denser and require fewer registers to store addresses. When reading/writing vectors and transposing the matrix, Tensile can hand reading vectors and writing components or reading components and writing vectors.
Pointer-Shifting:
Rather than having to use branches to guard against reading out-of-bounds, Tensile can now shift the read pointers to read in bounds before the main summation loop, then reorganize the accumulation registers after the main loop before writing the results. The result of this is protection against reading out-of-bounds when tensor sizes are not exact multiples of kernel tile sizes, but without having branch code in the main summation loop.
Others:
- Kernels have more flexibility as to which threads are assigned to load which elements from global memory.
- Benchmarking protocol can handle benchmarking a single kernel configuration.
- Library logic analysis can handle generating a library backend from a single data point, i.e., library will consist of single fastest kernel at single data point.
- Library logic analysis bug fix for when only a single solution is fastest for all data points.
v2.2.3 - SplitU and WorkGroupMapping
SplitU
If you have large summations but small C tensor, then you can create extra parallelism by splitting up the summation; This allows smaller C tensors to fill up larger GPUs.
WorkGroupMapping
Changes which work-groups operate on which tiles of tensor C. This can help performance by improving caching.