Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(checkpoint): support universal checkpoint #394

Open
wants to merge 12 commits into
base: develop
Choose a base branch
from

Conversation

li126com
Copy link
Collaborator

@li126com li126com commented Dec 23, 2024

特别声明:本功能模块技术路线基于veScale checkpoint和ByteCheckpoint实现。
veScale:https://github.com/volcengine/veScale/tree/main
ByteCheckpoint:https://arxiv.org/abs/2407.20143

通用检查点系统

通用ckpt系统独立于原版ckpt系统,相互不兼容。

基本功能

Dense 模型下 model ckpt 和 optimizer ckpt 的各种并行配置的动态加载支持:

  • GPU world size
  • tensor parallel
  • pipeline parallel
  • isp + wp
  • zero1

优化项

  • dp balance save:在同一组data parallel之间采用均分模型到每个dp rank,各自保存对应的ckpt。默认选项。
  • saving cache:训练过程中第一次检查点保存后将 save plan 以及 global meta 进行缓存,仅占用KB级host memory,在后续的检查点保存中可直接读取cache无需重新计算切片信息。默认选项。
  • async save::异步检查点保存,实际的IO过程将不会阻塞程序继续进行,在下一次检查点保存之前进行同步操作,此功能在首次保存ckpt时不生效。可选选项。
  • broadcast load:加载时对于同一组DP ranks,仅dp rank 0进行文件读取,再通过broadcast的方式广播到其他dp rank。可选选项。

精度验证

从dp4_zero2_tp2_pp2的配置下第100步ckpt开始续训:
img_v3_02hs_ca850ae0-a1f6-4619-a69a-c1ed6dfeee0g

性能对比

7B,16卡,dp4_zero2_tp2_pp2的配置
保存ckpt时间:
原版ckpt:38.8s
通用ckpt:首次保存18.5s,后续保存0.88s (save cache + async save)

加载ckpt时间:
相同配置下再加载,通用ckpt和原版ckpt时间差不多都在22s左右。
变动配置下,取决于具体的新配置,上述精度测试几组实验下来通用ckpt加载时间为22s-70s

@li126com li126com marked this pull request as draft December 23, 2024 08:21
@li126com li126com changed the title Universal checkpoint feat(checkpoint): support universal checkpoint Dec 25, 2024
@li126com li126com marked this pull request as ready for review December 25, 2024 06:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants