Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61

Open
Andy0422 opened this issue Dec 5, 2024 · 13 comments
Open

Launch error at 4090 for sageattn_qk_int8_pv_fp8_cuda #61

Andy0422 opened this issue Dec 5, 2024 · 13 comments
Labels
bug Something isn't working

Comments

@Andy0422
Copy link

Andy0422 commented Dec 5, 2024

Hi,

I think 4090 can support fp8 2D, why has the following error? Thanks.

Exception has occurred: RuntimeError
CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
File "/home/wei.zhao/SageAttention/example/sageattn_cogvideo.py", line 27, in
video = pipe(
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@Andy0422
Copy link
Author

Andy0422 commented Dec 5, 2024

L40s meet the same error.

@Andy0422
Copy link
Author

Andy0422 commented Dec 5, 2024

Another Question, for the arch "sm90", e.g.H100, why do you assign it at the kernel of sageattn_qk_int8_pv_fp16_cuda? It also has powerful fp8 2D capability.

@jt-zhang
Copy link
Member

jt-zhang commented Dec 6, 2024

Sorry, I can not reproduce this error.
By the way, if you want to use FP8, please ensure that the CUDA VERSION is >= 12.4.

@Andy0422
Copy link
Author

Andy0422 commented Dec 6, 2024

Sorry, I can not reproduce this error. By the way, if you want to use FP8, please ensure that the CUDA VERSION is >= 12.4.

Thank you for your kind reply. CUDA Version is 12.6,

and my code is, pls give me a hand,

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "3"

import time

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video, export_to_gif
from sageattention import sageattn, sageattn_qk_int8_pv_fp16_triton, sageattn_qk_int8_pv_fp16_cuda, sageattn_qk_int8_pv_fp8_cuda
import torch.nn.functional as F

F.scaled_dot_product_attention = sageattn_qk_int8_pv_fp8_cuda

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
"/home/dataset/SD/cogvideox-2b",
torch_dtype=torch.float16
).to("cuda")

pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

start = time.time()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
print("sage attn timing = ",time.time()-start)

export_to_video(video, "output/output.mp4", fps=8)
export_to_gif(video, "output/output.gif", fps=8)

-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:3C:00.0 Off | 0 |
| N/A 29C P0 69W / 250W | 6917MiB / 40960MiB | 11% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A100-PCIE-40GB On | 00000000:40:00.0 Off | 0 |
| N/A 28C P0 62W / 250W | 21005MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA H100 PCIe On | 00000000:CC:00.0 Off | 0 |
| N/A 32C P0 83W / 350W | 11544MiB / 81559MiB | 0% Default |
| | | Disabled |

@jason-huang03
Copy link
Member

I suppose it is the problem mentioned in #50 . Please run the code on device 0 again. We will merge this PR as soon as possible.

@jason-huang03
Copy link
Member

jason-huang03 commented Dec 6, 2024

@Andy0422
Also your environment contains 2 different GPU type. Our compilation script might not support this configuration at the present. Perhaps the fp8 kernel is not compiled because of A100, and when the code runs on H100 there will be error.

@jason-huang03
Copy link
Member

@Andy0422 on H100 the mma instruction for fp8 has poor performance so we use fp16 which is more accurate and has roughly the same speed. We are working on H100 kernel which uses wgmma that can offer real speed up.

@Andy0422
Copy link
Author

Andy0422 commented Dec 6, 2024

I suppose it is the problem mentioned in #50 . Please run the code on device 0 again. We will merge this PR as soon as possible.

see.. on H100 it can run now, but very slow

@Andy0422
Copy link
Author

Andy0422 commented Dec 6, 2024

@jason-huang03
Another problem at run 4090,
Traceback (most recent call last):
File "/home/wei.zhao/SageAttention/example/sageattn_cogvideo.py", line 27, in
video = pipe(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 684, in call
noise_pred = self.transformer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 473, in forward
hidden_states, encoder_hidden_states = block(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 132, in forward
attn_hidden_states, attn_encoder_hidden_states = self.attn1(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 495, in forward
return self.processor(
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 1954, in call
hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Fri Dec 6 11:58:12 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L40S On | 00000000:67:00.0 Off | 0 |
| N/A 25C P8 32W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A10 On | 00000000:E5:00.0 Off | 0 |
| 0% 26C P8 15W / 150W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA A10 On | 00000000:E6:00.0 Off | 0 |
| 0% 26C P8 15W / 150W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 On | 00000000:E8:00.0 Off | 0 |
| 45% 26C P8 22W / 450W | 1MiB / 23028MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

@jason-huang03
Copy link
Member

We will try to see how to make our compilation script compatible with multi-type gpu machine.

@jason-huang03 jason-huang03 added the bug Something isn't working label Dec 11, 2024
@Andy0422
Copy link
Author

@jason-huang03
hi,I changed a machine with same type gpu to test fp8 kernel, the bug seems still occurs. check the following,

CUDA QK Int8 PV FP8
batch: 4, head: 32, headdim: 128, pv_accum_dtype: fp32
is_causal: False
Traceback (most recent call last):
File "/home/wei.zhao/SageAttention/bench/bench_qk_int8_pv_fp8_cuda.py", line 56, in
torch.cuda.synchronize()
File "/usr/local/lib/python3.10/dist-packages/torch/cuda/init.py", line 952, in synchronize
return torch._C._cuda_synchronize()
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

And the machine has 8 L20, I select one of the them to run the code.. Looking forward for your reply

Wed Dec 18 03:39:18 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L20 On | 00000000:67:00.0 Off | 0 |
| N/A 27C P8 49W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L20 On | 00000000:68:00.0 Off | 0 |
| N/A 27C P8 36W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L20 On | 00000000:6C:00.0 Off | 0 |
| N/A 28C P0 50W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L20 On | 00000000:6D:00.0 Off | 0 |
| N/A 28C P8 35W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA L20 On | 00000000:E5:00.0 Off | 0 |
| N/A 28C P0 49W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA L20 On | 00000000:E6:00.0 Off | 0 |
| N/A 27C P0 36W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA L20 On | 00000000:E7:00.0 Off | 0 |
| N/A 26C P8 34W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA L20 On | 00000000:E8:00.0 Off | 0 |
| N/A 26C P0 33W / 350W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

@jason-huang03
Copy link
Member

@Andy0422 can you try CUDA_VISIBLE_DEVICES=0 before running your script?

@Andy0422
Copy link
Author

Andy0422 commented Dec 18, 2024

@jason-huang03

Hi, any update about this issue? Is a bug or my setup problems? Cheers!

@Andy0422 can you try CUDA_VISIBLE_DEVICES=0 before running your script?

@jason-huang03

It doesn't work... The following is my script,

import os

os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
from sageattention import sageattn, sageattn_qk_int8_pv_fp16_triton, sageattn_qk_int8_pv_fp8_cuda
import torch.nn.functional as F

F.scaled_dot_product_attention = sageattn

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
"/home/pretrained_models/CogVideoX-5b",
torch_dtype=torch.bfloat16
).to("cuda")

pipe.vae.enable_slicing()
pipe.vae.enable_tiling()

video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

###############################

/home/wei.zhao/SageAttention/sageattention/core.py:634: UserWarning: pv_accum_dtype is 'fp32+fp32', smooth_v will be ignored.
warnings.warn("pv_accum_dtype is 'fp32+fp32', smooth_v will be ignored.")
0%| | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/wei.zhao/SageAttention/example/sageattn_cogvideo.py", line 24, in
video = pipe(
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 710, in call
noise_pred = self.transformer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 503, in forward
hidden_states, encoder_hidden_states = block(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 132, in forward
attn_hidden_states, attn_encoder_hidden_states = self.attn1(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 588, in forward
return self.processor(
File "/usr/local/lib/python3.10/dist-packages/diffusers/models/attention_processor.py", line 2714, in call
hidden_states = hidden_states.transpose(1, 2).reshape(batch_size, -1, attn.heads * head_dim)
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants