feat(checkpoint): support universal checkpoint #394
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
特别声明:本功能模块技术路线基于veScale checkpoint和ByteCheckpoint实现。
veScale:https://github.com/volcengine/veScale/tree/main
ByteCheckpoint:https://arxiv.org/abs/2407.20143
通用检查点系统
通用ckpt系统独立于原版ckpt系统,相互不兼容。
基本功能
Dense 模型下 model ckpt 和 optimizer ckpt 的各种并行配置的动态加载支持:
优化项
精度验证
从dp4_zero2_tp2_pp2的配置下第100步ckpt开始续训:
性能对比
7B,16卡,dp4_zero2_tp2_pp2的配置
保存ckpt时间:
原版ckpt:38.8s
通用ckpt:首次保存18.5s,后续保存0.88s (save cache + async save)
加载ckpt时间:
相同配置下再加载,通用ckpt和原版ckpt时间差不多都在22s左右。
变动配置下,取决于具体的新配置,上述精度测试几组实验下来通用ckpt加载时间为22s-70s