add Flux.1 4xH100 performance (#369)

xdit-project · Nov 28, 2024 · 8275240 · 8275240
1 parent ca94011
commit 8275240
Show file tree

Hide file tree

Showing 3 changed files with 52 additions and 4 deletions.
diff --git a/README.md b/README.md
@@ -93,6 +93,7 @@ Furthermore, xDiT incorporates optimization techniques from [DiTFastAttn](https:
 
 <h2 id="updates">📢 Updates</h2>
 
+* 🎉**November 28, 2024**: xDiT achieves 1.6 sec end-to-end latency for 28-step [Flux.1-Dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) inference on 4xH100!
 * 🎉**November 20, 2024**: xDiT supports [CogVideoX-1.5](https://huggingface.co/THUDM/CogVideoX1.5-5B) and achieved 6.12x speedup compare to the implementation in diffusers!
 * 🎉**November 11, 2024**: xDiT has been applied to [mochi-1](https://github.com/xdit-project/mochi-xdit) and achieved 3.54x speedup compare to the official open source implementation!
 * 🎉**October 10, 2024**: xDiT applied DiTFastAttn to accelerate single GPU inference for Pixart Models!

diff --git a/docs/performance/flux.md b/docs/performance/flux.md
@@ -16,6 +16,31 @@ Since Flux.1 does not utilize Classifier-Free Guidance (CFG), it is not compatib
 
 We conducted performance benchmarking using FLUX.1 [dev] with 28 diffusion steps.
 
+The table below shows the latency (in seconds) using different USP strategies on 4xH100. Due to H100's excellent NVLink bandwidth, using USP is more appropriate than using PipeFusion.
+torch.compile optimization is crucial for H100, achieving a 2.6x speedup on 4xH100.
+On 2xH100, Ring achieves the lowest latency, while on 4xH100, Ulysses performs best. The hybrid-SP strategy Ulysses-2 x Ring-2 performs slightly worse than Ulysses-4 on 4xH100.
+The speedup on 4xH100 compared to a single H100 is 2.63x.
+
+<div align="center">
+
+| Configuration | PyTorch (Sec) | torch.compile (Sec) |
+|--------------|---------|---------|
+| 1 GPU | 6.71 | 4.30 |
+| Ulysses-2 | 4.38 | 2.68 |
+| Ring-2 | 5.31 | 2.60 |
+| Ulysses-2 x Ring-2 | 5.19 | 1.80 |
+| Ulysses-4 | 4.24 | 1.63 |
+| Ring-4 | 5.11 | 1.98 |
+
+</div>
+
+The figure below shows the latency metrics of Flux.1-dev on 4xH100. xDiT successfully generates a 1024px image in 1.6 seconds!
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/flux/Flux-1K-H100.png" 
+    alt="scalability-flux_h100">
+</div>
+
 The following figure shows the scalability of Flux.1 on two 8xL40 Nodes, 16xL40 GPUs in total. 
 Althogh cfg parallel is not available, We can still achieve enhanced scalability by using PipeFusion as a method for parallel between nodes.
 For the 1024px task, hybrid parallel on 16xL40 is 1.16x lower than on 8xL40, where the best configuration is ulysses=4 and pipefusion=4.
@@ -27,7 +52,6 @@ The performance improvement dose not achieved with 16 GPUs 2048px tasks.
     alt="scalability-flux_l40">
 </div>
 
-
 The following figure demonstrates the scalability of Flux.1 on 8xA100 GPUs.
 For both the 1024px and the 2048px image generation tasks, SP-Ulysses exhibits the lowest latency among the single parallel methods. The optimal hybrid strategy also are SP-Ulysses in this case.
 
@@ -81,7 +105,7 @@ This is due to the increased memory requirements for activations, along with mem
 
 By leveraging Parallel VAE, xDiT is able to demonstrate its capability for generating images at higher resolutions, enabling us to produce images with even greater detail and clarity. Applying `--use_parallel_vae` in the [runing script](../../examples/run.sh).
 
-prompt是"A hyperrealistic portrait of a weathered sailor in his 60s, with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin. He’s wearing a faded blue captain’s hat and a thick wool sweater. The background shows a misty harbor at dawn, with fishing boats barely visible in the distance."
+prompt is "A hyperrealistic portrait of a weathered sailor in his 60s, with deep-set blue eyes, a salt-and-pepper beard, and sun-weathered skin. He’s wearing a faded blue captain’s hat and a thick wool sweater. The background shows a misty harbor at dawn, with fishing boats barely visible in the distance."
 
 The quality of image generation at 2048px, 3072px, and 4096px resolutions is as follows. It is evident that the quality of the 4096px generated images is significantly lower.
 

diff --git a/docs/performance/flux_zh.md b/docs/performance/flux_zh.md
@@ -14,9 +14,32 @@ Flux.1实时部署有如下挑战：
 
 ### Flux.1 Dev的扩展性
 
-我们使用FLUX.1 [dev]进行了性能基准测试,采用28个扩散步骤。
+我们使用FLUX.1-dev进行了性能基准测试,采用28个扩散步骤。
 
-下图展示了Flux.1在两个8xL40节点(总共16xL40 GPU)上的可扩展性。
+下表是4xH100上，使用不同的USP策略的延迟（Sec）。因为H100优异的NVLink带宽，使用USP比使用PipeFusion更恰当。torch.compile优化对H100至关重要，4xH100上获得了2.6倍加速。
+
+<div align="center">
+
+| Configuration | PyTorch (Sec) | torch.compile (Sec) |
+|--------------|---------|---------|
+| 1 GPU | 6.71 | 4.30 |
+| Ulysses-2 | 4.38 | 2.68 |
+| Ring-2 | 5.31 | 2.60 |
+| Ulysses-2 x Ring-2 | 5.19 | 1.80 |
+| Ulysses-4 | 4.24 | 1.63 |
+| Ring-4 | 5.11 | 1.98 |
+
+</div>
+
+下图展示Flux.1-dev在4xH100上的延迟指标。xDiT成功在1.6秒内生成1024px图片！
+
+<div align="center">
+    <img src="https://raw.githubusercontent.com/xdit-project/xdit_assets/main/performance/flux/Flux-1K-H100.png" 
+    alt="scalability-flux_h100">
+</div>
+
+
+下图展示了Flux.1-dev在两个8xL40节点(总共16xL40 GPU)上的可扩展性。
 虽然无法使用cfg并行,但我们仍然可以通过使用PipeFusion作为节点间并行方法来实现增强的扩展性。
 对于1024px任务,16xL40上的混合并行比8xL40低1.16倍,其中最佳配置是ulysses=4和pipefusion=4。
 对于4096px任务,混合并行在16个L40上仍然有益,比8个GPU低1.9倍,其中配置为ulysses=2, ring=2和pipefusion=4。