Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc] sequence parallel document #6073

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
seq parallel doc
wangbluo committed Sep 27, 2024
commit 2f56b5ae4a446014d5ded7d03d37698f9f920ac5
4 changes: 4 additions & 0 deletions docs/source/zh-Hans/features/sequence_parallelism.md
Original file line number Diff line number Diff line change
@@ -151,9 +151,13 @@ for step, batch in enumerate(tqdm(dataloader, desc="Step", disable=not dist.get_

### 结论
在上述序列并行方法中,ring attn和Ulysses各有优劣,我们需要根据情况来选择合适的序列并行方法:

通信方面:Ulysses通信量优于ring attn,Ulysess主要包含三次All2All通信量,而ring attn的通信会随着序列长度增长而平方增长。不过另一方面,all2all对底层硬件的要求也会更高。

内存占用:二者类似。

模型结构泛化:ring attn优于Ulysses。Ulysses模型泛化性一般,对于head number有要求,需要满足:`head number // (tp group size * sp group size)`,而ring attn没有此限制。

由于使用简单,对Attention计算不侵入修改,Ulysses目前是序列并行的主流。这些序列并行都可与其他高性能注意力兼容,如flash attention,还可以与ZeRO、TP、PP、DP等多种并行训练策略混合使用。

总的来说,对于初学者、中小型企业客户,我们更推荐您使用all_to_all,经过测试,在双机16卡的情况下,使用```--tp 2 --sp 8 --sp_mode all_to_all```的启动参数可以很轻松训练128k长度的序列,同时他的性能表现也是所有序列并行模式中最好的。但如果追求极致性能优化,或者使用较多机器训练长文本,可以考虑使用ring attention模式的序列并行。