Releases · ROCm/Tensile

29 Nov 15:07

guacamoleo

v3.3.7

1b301dd

v3.3.7 Bug Fix for gfx803 ISA

Bug Fixes:

changed v_add_i32 to v_add_u32 as former isn't part of gfx803.

Assets 2

20 Nov 19:57

guacamoleo

v3.3.6

2a7284f

Removed Hard-Coded Paths and Fixed Library Logic Analyzer

Bug Fixes:

Removed hard-coded paths (/opt/rocm/bin/) for hcc, rocm-smi, rocm_agent_enumerator so that Tensile will work even if users have installed ROCm to an alternative paths.
In library logic generator:
1. For benchmarks where M=N=K there was a bug in generating initial selection rules since a problem size may not have any valid solution. In these cases the analyzer just moves on to the next problem sizes to find a winner.
2. Where different groups of solutions were benchmarked for different groups of problem sizes, the winner selector correctly assign scores of infinity to represent that a solution cannot be used for a problem size since it wasn't benchmarked for a problem sizes. This fixed the M<4 bug in rocBLAS::gemm.
3. The secondary winner was only being updated when the winner was, then removing unimportant solution was based on speedup of winners vs secondaries. This has been corrected and is important for SolutionImportanceMin>0.

Assets 2

01 Nov 19:22

guacamoleo

v3.3.5

022743c

v3.3.5 - Assembly Local Summation Splitting

Assembly

LocalSplitU implemented. Improves performance when M*N is small by breaking a workgroup up into subgroups which work on the same tile, reducing the partial summation into local memory, then writing the resulting summation to C. LocalSplitU can be combined with GlobalSplitU.

Other

Added deepbench.yaml so Tensile can be tuned for Baidu's DeepBench gemm sizes.
alpha/beta values can be controlled for validation
fixed thread safety of looking up assembly kernels
fixed associating an assembly kernel's module with a specific device

Assets 2

20 Oct 19:01

guacamoleo

v3.2.3

b894676

v3.2.3 - Assembly Global Summation Splitting

Assembly:

GlobalSplitU implemented. Improves performance when M*N is too small to fill the GPU with workgroups.
Improved implementation of "division by invariant multiplication" so that ThreadTiles, WorkGroups, and DepthUs can all be non-powers of 2.
~100,000 assembly kernel are working for each of the 4 transpose cases.
LocalSplitU is the only remaining kernel feature not implemented for sgemm.

Other:

DepthU < 0 represents key values, such as each thread loads at least 1 vector from global memory.
SleepPercent global parameter will cause benchmarking to sleep after each data point, to give the gpu time to cool off, prevent overheating, and ensure a more uniform benchmarking process. SleepPercent = 200 means sleep time will be kernel_time * 200%.

Assets 2

07 Oct 02:44

guacamoleo

v3.2.0

7eec82f

v3.2.0 - Assembly for gfx803 and gfx900 sgemm

Assembly Kernels:

Can produce assembly kernels for AMD architectures gfx803 and gfx900. The assembly generator supports single precision gemm and batched gemm.
GlobalSplitU and LocalSplitU are the only sgemm features not yet implemented.
On both architectures, these kernels can achieve 94% efficiency for a few problem sizes and 90% efficiency for many problem sizes.
See Tensile/Configs/sgemm_asm.yaml for examples.
Source and assembly kernels can be benchmarked with the solution parameter KernelLanguage: ["Source", "Assembly"].

Bug Fixes:

Prior versions had bugs handling N-dimensional tensors, this release once again supports tensors of any dimensions. See Tensile/Configs/tensor_contraction.yaml and convolution.yaml for examples.

Other:

The benchmarking clients permit more command line parameters for easier use.
Clients also print gpu's current clock speeds as well as temperature in order to verify that all benchmarking is being performed consistently.

Assets 2

29 Aug 21:05

guacamoleo

v3.0.4

98cf4f2

v3.0.4 - Fixed NaN propagation

When Beta==0, kernels write to C tensor without reading from it.

Assets 2

30 May 18:31

guacamoleo

v3.0.0

2d1d9ff

v3.0.0 - GlobalSplitU and Improved Benchmarking / Library Logic

GlobalSplitU: On top of LocalSplitU, Tensile now supporting splitting up the summation between work-groups. This option requires a beta-only kernel followed by a gemm kernel which uses atomic compare-and-swap to accumulate results in global memory. This feature increases the number of work-groups while maintaining tile size, with the drawback of slower global memory accumulation.

Improved Benchmarking / Library Logic:

Users can perform multiple benchmark runs for a single problem type; this allows for tuning multiple problem size groups.
Users can specify multiple problem size ranges as well as exact sizes to do training and logic generation for.
Users can label a benchmark with a schedule name and a list of devices which the schedule supports. Tensile will choose solution schedule based on device.

Semantic Versioning: Users can specify minimum Tensile versions in yaml files to guarantee support & compatibility.

Expanded Work-Group and Thread-Tile Sizes: Users can explicitly specify work-group sizes and thread-tile sizes which are not powers of 2 and not even even.

Maximum Occupancy: For problem sizes or strides which are known to thrash the gpu caches, users can manually lower the occupancy of the work-groups to try and improve performance.

Assets 2

27 Apr 14:50

guacamoleo

v2.4.5

b5d359d

v2.4.5 - Prefetching and Half-Precision

Prefetch Global -> Local
Issues loads from global memory into lds memory one full iteration in advance. This uses double the lds but hides global memory latency better.

Prefetch Local -> Registers
Issues loads from lds into registers one unrolled iteration in advance. This uses several extra registers but hides lds latency better.

Half-Precision
"half"/__fp16 is now a supported data type.

TensileBenchmarkLibraryClient.py
This python script takes a library client executable and a csv file of data sizes as inputs, and will run the executable on the data sizes.

Assets 2

07 Apr 13:47

guacamoleo

v2.3.0

43a6c36

v2.3.0 - Short-Vectors and Pointer-Shifting

Short-Vectors:
The kernels can now operate on float2* or float4* pointers which means the reads and writes to memory are denser and require fewer registers to store addresses. When reading/writing vectors and transposing the matrix, Tensile can hand reading vectors and writing components or reading components and writing vectors.

Pointer-Shifting:
Rather than having to use branches to guard against reading out-of-bounds, Tensile can now shift the read pointers to read in bounds before the main summation loop, then reorganize the accumulation registers after the main loop before writing the results. The result of this is protection against reading out-of-bounds when tensor sizes are not exact multiples of kernel tile sizes, but without having branch code in the main summation loop.

Others:

Kernels have more flexibility as to which threads are assigned to load which elements from global memory.
Benchmarking protocol can handle benchmarking a single kernel configuration.
Library logic analysis can handle generating a library backend from a single data point, i.e., library will consist of single fastest kernel at single data point.
Library logic analysis bug fix for when only a single solution is fastest for all data points.

Assets 2

30 Mar 17:18

guacamoleo

v2.2.3

98d1340

v2.2.3 - SplitU and WorkGroupMapping

SplitU
If you have large summations but small C tensor, then you can create extra parallelism by splitting up the summation; This allows smaller C tensors to fill up larger GPUs.

WorkGroupMapping
Changes which work-groups operate on which tiles of tensor C. This can help performance by improving caching.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: ROCm/Tensile

v3.3.7 Bug Fix for gfx803 ISA

Removed Hard-Coded Paths and Fixed Library Logic Analyzer

v3.3.5 - Assembly Local Summation Splitting

v3.2.3 - Assembly Global Summation Splitting

v3.2.0 - Assembly for gfx803 and gfx900 sgemm

v3.0.4 - Fixed NaN propagation

v3.0.0 - GlobalSplitU and Improved Benchmarking / Library Logic

v2.4.5 - Prefetching and Half-Precision

v2.3.0 - Short-Vectors and Pointer-Shifting

v2.2.3 - SplitU and WorkGroupMapping