-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[QST]CUTLASS support for fp8 sparse matrix(for W) multiplication for A*W=Y with GPU(SM90a/89)sparse tensor core #2029
Comments
The sparse tensor cores in sm90a and sm89 for fp8 in the formart A: row+sparse x B: col+dense = C:dense. The sparse gemm kernel is limited by this. |
Thanks very much. @hwu36 @klevzoff |
no. we are limited by the hw. 2.How to apply sparse matrix multiplication to speed up inference in LLama FP8 models? i am not a model guy. i cannot answer it. a little more background. for dense fp8 gemm, we also only support A: row x B: col
sm70 does not have sparse tensor cores. for fp16/bf16, A and B can be any combination of row and col. when i said |
Assuming B is col-major, you can do this with swap+transpose trick based on identity using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
TileShape, ClusterShape,
cutlass::epilogue::collective::EpilogueTileAuto,
ElementAccumulator, ElementAccumulator,
ElementC, ColumnMajor, AlignmentC, // Note: ColumnMajor instead of RowMajor
ElementD, ColumnMajor, AlignmentD, // Note: ColumnMajor instead of RowMajor
EpilogueSchedule
>::CollectiveOp;
using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm90, cutlass::arch::OpClassSparseTensorOp,
ElementB, RowMajor, AlignmentB, // Note: B+RowMajor instead of A+ColumnMajor
ElementA, ColumnMajor, AlignmentA, // Note: A+ColumnMajor instead of B+RowMajor
ElementAccumulator,
TileShape, ClusterShape,
cutlass::gemm::collective::StageCountAutoCarveout<
static_cast<int>(sizeof(typename CollectiveEpilogue::SharedStorage))>,
KernelSchedule
>::CollectiveOp; Correspondingly, swap A/B pointers and strides when you construct the mainloop arguments: typename CollectiveMainloop::Arguments {
ptr_B, layout_B, // Note: B instead of A
ptr_A, stride_A, // Note: A instead of B
ptr_E, layout_E
}; and swap M/N in |
Check out this blog post by NeuralMagic and their Sparse-Llama model which builds on top of these sparse matmuls. |
What is your question?
In the 62_hopper_sparse_gemm example, it seems that matrix A is not a weight, and the original weight is stored in rows; does fp8 matrix multiplication support sparse weights (does sparse weights support column storage) for matrix multiplication on SM89/90 graphics cards?
The text was updated successfully, but these errors were encountered: