v0.4.0
New Features (highlights)
- Streaming multipack for continued pre-training
- Mistral & Mixtral support
- Simplified Multipack for Mistral, Falcon, Qwen2, and Phi
- DPO/IPO/KTO-pairs RL-training support via trl
- Improve BatchSampler for multipack support, allows for resume from checkpointing, shuffling data each epoch
- bf16: auto support
- add MLFlow support
- save YAML configs to WandB
- save predictions during evals to WandB
- more tests! more smoke tests for smol model training
- NEFTune support
What's Changed
- document that packaging needs to be installed before flash-attn by @winglian in #559
- Fix pretraining with iterable/streaming Dataset by @jphme in #556
- Add training callback to send predictions to WandB table by @Glavin001 in #521
- fix wandb so mypy doesn't complain by @winglian in #562
- check for the existence of the default accelerate config that can create headaches by @winglian in #561
- add optimization for group-by-len by @winglian in #563
- gracefully handle length feature used for group by by @winglian in #565
- improve how we setup eval/save strategies and steps by @winglian in #547
- let hf trainer handle torch compile by @winglian in #516
- Model parallel by @winglian in #538
- fix save_steps so it doesn't get duplicated by @winglian in #567
- set auto for other params that hf trainer sets for ds. include zero1 json by @winglian in #570
- remove columns after tokenizing for pretraining by @winglian in #571
- mypy wandb ignore by @winglian in #572
- Phi examples by @winglian in #569
- e2e testing by @winglian in #574
- E2e device cuda by @winglian in #575
- E2e passing tests by @winglian in #576
- refactor scripts/finetune.py into new cli modules by @winglian in #550
- update support matrix with btlm and phi by @winglian in #579
- prevent cli functions from getting fired on import by @winglian in #581
- Fix Codellama examples by @Kimiko-AI in #582
- support custom field for completion from yml by @winglian in #580
- Feat(doc): Add features to doc by @NanoCode012 in #583
- Support Sample packing for phi arch by @winglian in #586
- don't resize embeddings if it's already large enough by @winglian in #577
- Enable full (non-sharded) model saving with SHARDED_STATE_DICT by @jphme in #584
- make phi training work with Loras by @winglian in #588
- optionally configure sample packing for evals by @winglian in #589
- don't add position_ids for evals when not using eval sample packing by @winglian in #591
- gather/broadcast the max value of the packing efficiency automatically by @winglian in #463
- Feat(data): Allow loading local csv and text by @NanoCode012 in #594
- add bf16 check by @winglian in #587
- btlm and falcon monkey patches for flash attn by @winglian in #566
- minor tweaks to simplify by @winglian in #597
- Fix for check with cfg and merge_lora by @winglian in #600
- improve handling for empty text on the tokenization step by @winglian in #502
- more sane defaults for openllama 3b used for quickstarts by @winglian in #602
- update dockerfile to not build evoformer since it fails the build by @winglian in #607
- Delete duplicate lines in models.py by @bofenghuang in #606
- support to disable exllama for gptq by @winglian in #604
- Update requirements.txt - Duplicated package by @Psancs05 in #610
- Only run tests when a change to python files is made by @maximegmd in #614
- Create multi-node.md by @maximegmd in #613
- fix distributed devices by @maximegmd in #612
- ignore wandb to resolve isort headaches by @winglian in #619
- skip the gpu memory checks if the device is set to 'auto' by @winglian in #609
- let MAX_JOBS use the default since we're not resource constrained on our self-hosted runners by @winglian in #427
- run eval on the first step to get a baseline by @winglian in #617
- split completion text to sequence_len by @winglian in #616
- misc fixes to add gptq tests by @winglian in #621
- chore(callback): Remove old peft saving code by @NanoCode012 in #510
- update README w deepspeed info by @winglian in #605
- create a model card with axolotl badge by @winglian in #624
- better handling and logging of empty sharegpt turns by @winglian in #603
- tweak: improve base builder for smaller layers by @maximegmd in #500
- Feat(doc): Add eval_sample_packing to doc by @NanoCode012 in #625
- Fix: Fail bf16 check when running on cpu during merge by @NanoCode012 in #631
- default model changed by @mhenrichsen in #629
- Added quotes to the pip install -e command in the documentation to fix an incompatibility β¦ by @Nan-Do in #632
- Feat: Add support for upstream FA2 by @NanoCode012 in #626
- eval_table isn't quite stable enough to be in default llama configs by @winglian in #637
- attention_mask not needed for training by @winglian in #642
- update for recent transformers updates by @winglian in #636
- use fastchat conversations template by @winglian in #578
- skip some flash attn patches unless explicitly enabled by @winglian in #643
- Correct typos in datasets.py by @felixonmars in #639
- Fix bug in dataset loading by @ethanhs in #284
- Warn users to login to HuggingFace by @Napuh in #645
- Mistral flash attn packing by @winglian in #646
- Fix(cfg): Add validation for save_strategy and eval_strategy by @NanoCode012 in #633
- Feat: Add example for Mistral by @NanoCode012 in #644
- Add mistral/README.md by @adarshxs in #647
- fix for flash attn w mistral w/o sammple packing by @winglian in #648
- don't strip the prompt for check since we don't strip to tokenize anymore by @winglian in #650
- add support for defined train split by @winglian in #654
- Fix bug when using pretokenized datasets by @ein-ich in #652
- Make dataset_processes configurable by @corbt in #651
- add mistral e2e tests by @winglian in #649
- removed duplicate on requirements.txt by @Napuh in #661
- make sure we also run CI tests when requirements.txt changes by @winglian in #663
- prepared dataset caching, other misc fixes by @winglian in #665
- remove patch fix for phi by @winglian in #664
- refactor to set eval_batch_size earlier if unset, so we can warn if mismatched by @winglian in #662
- Feat: Add config yaml to section for reprod in bug-report.yaml by @NanoCode012 in #667
- Feat: Allow usage of native Mistral FA when no sample_packing by @NanoCode012 in #669
- chore: Clean up repetitive model kwargs by @NanoCode012 in #670
- Fix(version): Update FA to work with Mistral SWA by @NanoCode012 in #673
- Fix(tokenizer): Set rstrip,lstrip,norm to False by @NanoCode012 in #678
- Fix: Future deprecation warning with use_auth_token by @NanoCode012 in #680
- Feat: Set WORKDIR to /workspace/axolotl by @NanoCode012 in #679
- Fix: ValueError when FA + Mistral when padding_side=right by @NanoCode012 in #681
- flash_attention + sample packing for stablelm 3b by @winglian in #671
- Adding qlora config for Mistral by @TokenBender in #675
- Fix: Higher vram usage for mistral and sample_packing by @NanoCode012 in #691
- fix multiline for docker by @winglian in #694
- update mistral lr, sample pack by @mhenrichsen in #693
- apex not needed as amp is part of pytorch by @winglian in #696
- add docker images for pytorch 2.10 by @winglian in #697
- fix unneeded space by @mhenrichsen in #699
- Update README with some explanations by @seungduk-yanolja in #700
- Get qlora mistral-7b fine tuning working on a single 4090 by @lukemarsden in #708
- fix(doc): Add note on inference w sample packing by @NanoCode012 in #712
- Fix: lowercase
True
values in config by @atgctg in #713 - fix(doc): update default doc according to arg by @NanoCode012 in #714
- Save Axolotl config as WandB artifact by @jphme in #716
- improve handling of the prepared ds path and other cfg defaults by @winglian in #701
- fix pytorch 2.1.0 build, add multipack docs by @winglian in #722
- add noisy embedding by @maximegmd in #721
- pin xformers >= 0.0.22 by @winglian in #724
- misc sharegpt fixes by @winglian in #723
- workaround for installing xformers w torch 2.1.0 by @winglian in #725
- tweak for xformers install w pytorch 2.1.0 by @winglian in #727
- fixes for alpaca w chatml, and don't include attention_mask w mistral for flash attention by @winglian in #728
- Clarify custom format example by @casper-hansen in #729
- Mistral: Sliding Window Attention with Flash Attention and Sample Packing by @casper-hansen in #732
- badge by @mhenrichsen in #739
- catch ConnectionError when checking dataset from HuggingFace by @Napuh in #743
- Fix(model): Linear detected and added to target module with rope linear by @NanoCode012 in #738
- improve: Enhance code readability of prompt_tokenizers.py by @seungduk-yanolja in #707
- add a latest tag for regular axolotl image, cleanup extraneous print statement by @winglian in #746
- Fix DeepSpeed Zero 3 Saving by @tokestermw in #709
- chore: bump transformers to v4.34.1 to fix tokenizer issue by @NanoCode012 in #745
- add to docs by @winglian in #703
- Implement fused modules by @casper-hansen in #747
- remove lora fused packing test by @winglian in #758
- Fix: eval table conflict with eval_sample_packing by @NanoCode012 in #769
- Fix: Cannot tokenize with bf16 and on cpu by @NanoCode012 in #766
- Hotfix for fused QKV not saving the trained weights of o_proj by @casper-hansen in #762
- convert exponential notation lr to floats by @winglian in #771
- Fix: Warn when fullfinetune without adapter by @NanoCode012 in #770
- simplify by removing duplicate base_model_config by @winglian in #772
- disable eval table w sample packing in examples by @winglian in #778
- refactor setup trainer so we can add more hooks by @winglian in #773
- chore: refactor truthy check and fix mypy by @NanoCode012 in #780
- chore(readme): Improve documentation on conversation field by @NanoCode012 in #782
- Threaded MultipackDistributedDataloader with prefetched samples by @casper-hansen in #759
- Create preprocess CLI by @casper-hansen in #785
- Add docker advanced instruction to README by @gordicaleksa in #792
- Fix Deepspeed Zero3 Config by @teknium1 in #791
- Update to adapt to sharegpt datasets with "assistant" rather than "gp⦠by @MilesQLi in #774
- fix eval_steps to be a sane default by @winglian in #797
- refactor neft patch to be more re-usable similar to trl's impl by @winglian in #796
- fix(config): Set eos/bos to tokenizer if different by @NanoCode012 in #801
- feat(doc): add dummyoptim faq fix by @NanoCode012 in #802
- fix(tokenizer): update log order after update by @NanoCode012 in #806
- fix model parallel by @winglian in #816
- fix: pin autogptq by @NanoCode012 in #818
- update table for rwkv4 support, fix process count for dataset by @winglian in #822
- Feat: Added Gradio support by @Stillerman in #812
- Dockerfile: add deepspeed-kernels dependency for deepspeed>=0.12.0 by @fpreiss in #827
- cleanup verbosity a bit by @winglian in #799
- make sure to cleanup tmp output_dir for e2e tests by @winglian in #831
- multipack w batch sampler by @winglian in #795
- don't compile deepspeed or bitsandbytes from source by @winglian in #837
- Pin optimum package by @brthor in #838
- cleanup the old multipack dataloader by @winglian in #841
- include the suffix modified string in ascii art by @fpreiss in #852
- feat(doc): add more info on train_on_split by @NanoCode012 in #855
- chore(doc): Separate section on runpod by @NanoCode012 in #860
- various bugfixes by @winglian in #856
- adds llama and mistral dropout support by @winglian in #858
- multipack len should use max, not min by @winglian in #863
- Docs: add instructions to 1-click launching on public clouds by @concretevitamin in #862
- Update data.py for signature generation by @MilesQLi in #851
- lint fix that didn't get caught by linter by @winglian in #866
- make docker command more robust by @winglian in #861
- add e2e tests for checking functionality of resume from checkpoint by @winglian in #865
- allow overriding of model_config parameters from the YML by @winglian in #853
- Feat: Add dataset loading from S3, GCS by @NanoCode012 in #765
- try #2: pin hf transformers and accelerate to latest release, don't reinstall pytorch by @winglian in #867
- don't train if eval split is too small by @winglian in #873
- Phi update 202311 by @winglian in #876
- Install from git url by @msaroufim in #874
- fix: revert local dir dataset load by @NanoCode012 in #878
- chore(doc): Add info on changing role in sharegpt by @NanoCode012 in #886
- Feat: Add warmup_ratio by @NanoCode012 in #893
- fix: warning should not show if eval_batch_size not provided by @NanoCode012 in #896
- Feat: Add Qwen by @NanoCode012 in #894
- update datasets version to cut down the warnings due to pyarrow arg change by @winglian in #897
- fix: remove FA for qwen examples by @NanoCode012 in #900
- Determine FSDP/deepspeed settings on device select. by @kallewoof in #883
- ensure merged model matches the training dtype by @winglian in #902
- fix for qwen w lora by @winglian in #906
- Remove lr scheduler in DeepSpeed config to avoid conflict by @Haoxiang-Wang in #909
- feature: loss watchdog for terminating training runs that are failing by @kallewoof in #899
- Feat(wandb): Refactor to be more flexible by @NanoCode012 in #767
- Support device_map=sequential & max_memory config parameters by @brthor in #903
- feat: add check for quantized model by @NanoCode012 in #913
- Pin flash-attn to 2.3.3 by @casper-hansen in #919
- fix(tokenizer): handle fast tokenizer properly for bos/eos by @NanoCode012 in #914
- support for mamba by @winglian in #915
- fixing prompt template of chatml by removal of linebreak by @timothylimyl in #922
- Mixtral multipack by @winglian in #928
- update to latest transformers for mixstral support by @winglian in #929
- Mixtral: More correct MoE, lower loss by @casper-hansen in #932
- Update requirements.txt (fschat==0.2.34) by @tokestermw in #940
- Mixtral official by @winglian in #942
- Respect sequence_len in config for
type: llama2_chat
by @hamelsmu in #926 - new evals_per_epoch and saves_per_epoch to make things cleaner by @winglian in #944
- More hints on what to do with CUDA Out of memory errors by @jooray in #925
- fix: remove excessive newlines in system prompt(s) for alpaca by @kallewoof in #936
- Flash attn hotfix by @winglian in #951
- Fix Deepspeed loading by @winglian in #950
- fix: switch to using the HuggingFace Transformers NEFT implementation by @kallewoof in #941
- Add docs by @hamelsmu in #947
- Fix prompt assembly for llama by @hamelsmu in #952
- update transformers to fix checkpoint saving by @dumpmemory in #963
- update to latest nccl in docker image by @winglian in #965
- fix for build for nccl in dockerfile by @winglian in #970
- fix: add lr scheduler kwargs to Trainer by @NanoCode012 in #972
- Update README.md by @eltociear in #966
- Dockerfile torch fix by @winglian in #987
- fix mistral prompt assembly by @hamelsmu in #982
- Feat: Warns to add to modules_to_save when adding tokens or switching special_tokens by @NanoCode012 in #787
- Add tests to Docker by @hamelsmu in #993
- change val size by @mhenrichsen in #992
- chore: Update transformers to latest by @NanoCode012 in #986
- support for cuda 12.1 by @winglian in #989
- set output_router_logits for mixtral config: by @winglian in #995
- Add an example config for finetuning a 34B model on a 24GB GPU by @evangriffiths in #1000
- FEAT: add tagging support to axolotl by @younesbelkada in #1004
- Set eval_sample_packing to false in mistral config.yaml by @kmsydney in #1003
- add config to model card by @hamelsmu in #1005
- remove landmark attn and xpos rope implementations by @winglian in #1010
- [Docs] Nit: clarify what inference is by @hamelsmu in #1012
- [Docs] Nit: Remind people to auth to wandb if they are going to use it by @hamelsmu in #1013
- feat: remove need to add load_in* during merge by @NanoCode012 in #1017
- feat: expose bnb kwargs by @NanoCode012 in #1018
- add ultrachat prompt strategies by @winglian in #996
- [WandB] Push axolotl config to top level wandb files by @hamelsmu in #1014
- Adds chat templates by @mhenrichsen in #1022
- Fix: bf16 support for inference by @taziksh in #981
- use recommended setting for use_reentrant w gradient checkpointing by @winglian in #1021
- added tiny llama examples for lora and qlora by @tdolan21 in #1027
- chore(readme): update instruction to set config to load from cache by @NanoCode012 in #1030
- [Docs] delete unused cfg value
lora_out_dir
by @hamelsmu in #1029 - fix: lint by @NanoCode012 in #1037
- chore(config): clean up old log for Qwen by @NanoCode012 in #1034
- bump transformers and update attention class map name by @winglian in #1023
- Added chatglm3 conversation type for training models like TinyLLama by @xaviviro in #1036
- fix HF model card upload for PEFT models by @hamelsmu in #1043
- Clean Up LorA Merge by @hamelsmu in #1044
- feature: better device mapping for large models by @kallewoof in #918
- feat: always push checkpoint to hub if set by @NanoCode012 in #1049
- Update tests-docker.yml by @hamelsmu in #1052
- streaming multipack for pretraining dataset by @jinwonkim93 in #959
- Simplify Docker Unit Test CI by @hamelsmu in #1055
- Phi2 rewrite by @winglian in #1058
- Efficiently get the length of the tokenized docs by @RicardoDominguez in #1063
- Sponsors by @winglian in #1065
- Update FUNDING.yml for Kofi link by @winglian in #1067
- fix: torch_dtype mistral default to fp32 by @NanoCode012 in #1050
- Cosine learning rate schedule - minimum learning rate by @RicardoDominguez in #1062
- fix double eos token for chatml by @winglian in #1054
- Add: mlflow for experiment tracking by @JohanWork in #1059
- update peft to 0.7.0 by @mtenenholtz in #1073
- paired kto support by @winglian in #1069
- Separate AutoGPTQ dep to
pip install -e .[auto-gptq]
by @casper-hansen in #1077 - attempt to also run e2e tests that needs gpus by @winglian in #1070
- Update FUNDING.yml with bitcoin by @winglian in #1079
- swap the data collator for evals if not using sample packing by @winglian in #1076
- be more robust about checking embedding modules for lora finetunes by @winglian in #1074
- fix:
train_on_inputs: true
ignored for sharegpt by @NanoCode012 in #1045 - update sharegpt conversations when chatml chat template is set by @winglian in #1075
- additional logging to get maximum token length of a sequence in the dataset by @winglian in #1066
- pin accelerate for deepspeed fix by @winglian in #1080
- fix: warn user to install mamba_ssm package by @NanoCode012 in #1019
- use tags again for test image, only run docker e2e after pre-commit checks by @winglian in #1081
- optimize calculation of cu_seqlens from position_ids by @winglian in #1084
- add python 3.11 to the matrix for unit tests by @winglian in #1085
- Remove fused-dense-lib from requirements.txt by @casper-hansen in #1087
- misc fixes from #943 by @winglian in #1086
- add gptneox embeddings, fix phi2 inputs, also fix the casting by @winglian in #1083
- Add Debugging Guide by @hamelsmu in #1089
- Fix debugging.md by @hamelsmu in #1091
- feat: enable trl's autounwrap by @NanoCode012 in #1060
- Fix broken pypi.yml by @msaroufim in #1099
- Update README.md by @hamelsmu in #1103
- Add section for debugging with Docker by @hamelsmu in #1104
- Add link on README to Docker Debugging by @hamelsmu in #1107
- keep gate in fp32 for loras by @winglian in #1105
- Fix debugging video by @hamelsmu in #1111
- Disable caching on
--disable_caching
in CLI by @casper-hansen in #1110 - Reverse caching PR by @casper-hansen in #1115
- Enable or disable bf16 support based on availability by @simhallq in #1116
- update PR template so we can capture twitter or discord handles by @winglian in #1121
- pin model_revision for phi2 by @winglian in #1123
- fix(readme): clarify custom user prompt [no-ci] by @NanoCode012 in #1124
- Add
layers_to_transform
forlora_config
by @xzuyn in #1118 - Agnostic cloud gpu docker image and Jupyter lab by @winglian in #1097
- Preprocess dataset size fix by @winglian in #1131
- fix(preprocess): Make sure dataset not loaded from cache when using preprocess cli by @NanoCode012 in #1136
- fix bf16 check when preprocessing data by @winglian in #1140
- Add shifted sparse attention by @joecummings in #973
- Multipack simplify for Mixtral by @winglian in #1142
- Fix link for Minotaur model by @joecummings in #1146
- Dockerfile cloud ports by @winglian in #1148
- fix check for env var by @winglian in #1151
- feat(dataset): add config to keep processed dataset in memory by @NanoCode012 in #1152
- Deprecate max packed sequence len by @winglian in #1141
- make sure the model config loader respects the model_revision too by @winglian in #1160
- Qwen2 by @winglian in #1166
- jupyter lab fixes by @winglian in #1139
- set fp16 to false if bf16, update bf16: auto in example YAMLs by @winglian in #1122
- Add mlflow callback for pushing config to mlflow artifacts by @JohanWork in #1125
- improve vram use w gradient checkpointing by @winglian in #1167
- Vram fix attempt by @winglian in #1164
- add commit message option to skip docker image builds in ci by @winglian in #1168
- Falcon embeddings by @winglian in #1149
- support for explicit test_dataset definition for evals by @winglian in #786
- Add desc to map/filter by @casper-hansen in #1162
- Feat(test): Add tests for alpaca chatml prompt tokenizer by @JohanWork in #1088
- DPO cleanup by @winglian in #1126
- Update README.md by @singhay in #1169
- Fine-Tuning Mistral-7b for Real-World Chatbot Applications Using Axolotl (Lora used) by @Tilemachoc in #1155
- don't fail if can't cast weights due to offload when merging by @winglian in #1172
- update docs by @winglian in #1176
- Phi2 multipack by @winglian in #1173
- DPO fixes v2 by @winglian in #1174
- Docs: RLHF Update after cleanup by @AlekseyKorshuk in #1178
- Add support for offline mode with HF_HUB_OFFLINE envvar by @JamesHWade in #1182
- Fix do_merge_lora raises an Exception in transformers v4.37.0 by @tisorlawan in #1184
- report min lenght of tokenized data by @winglian in #1186
- more dpo fixes for dataset loading and docs by @winglian in #1185
- upgrade deepspeed to 0.13.1 for mixtral fixes by @winglian in #1189
- Standardize system prompt format for AlpacaPrompter (instruct case) by @sadaisystems in #1190
- Mixtral fixes 20240124 by @winglian in #1192
- prepare for release v0.4.0 by @winglian in #1175
New Contributors
- @Kimiko-AI made their first contribution in #582
- @bofenghuang made their first contribution in #606
- @Psancs05 made their first contribution in #610
- @Nan-Do made their first contribution in #632
- @felixonmars made their first contribution in #639
- @Napuh made their first contribution in #645
- @adarshxs made their first contribution in #647
- @ein-ich made their first contribution in #652
- @corbt made their first contribution in #651
- @TokenBender made their first contribution in #675
- @seungduk-yanolja made their first contribution in #700
- @lukemarsden made their first contribution in #708
- @atgctg made their first contribution in #713
- @casper-hansen made their first contribution in #729
- @tokestermw made their first contribution in #709
- @gordicaleksa made their first contribution in #792
- @MilesQLi made their first contribution in #774
- @Stillerman made their first contribution in #812
- @fpreiss made their first contribution in #827
- @brthor made their first contribution in #838
- @concretevitamin made their first contribution in #862
- @msaroufim made their first contribution in #874
- @kallewoof made their first contribution in #883
- @Haoxiang-Wang made their first contribution in #909
- @timothylimyl made their first contribution in #922
- @hamelsmu made their first contribution in #926
- @jooray made their first contribution in #925
- @dumpmemory made their first contribution in #963
- @eltociear made their first contribution in #966
- @evangriffiths made their first contribution in #1000
- @younesbelkada made their first contribution in #1004
- @kmsydney made their first contribution in #1003
- @taziksh made their first contribution in #981
- @tdolan21 made their first contribution in #1027
- @xaviviro made their first contribution in #1036
- @jinwonkim93 made their first contribution in #959
- @RicardoDominguez made their first contribution in #1063
- @JohanWork made their first contribution in #1059
- @mtenenholtz made their first contribution in #1073
- @simhallq made their first contribution in #1116
- @xzuyn made their first contribution in #1118
- @joecummings made their first contribution in #973
- @singhay made their first contribution in #1169
- @Tilemachoc made their first contribution in #1155
- @AlekseyKorshuk made their first contribution in #1178
- @JamesHWade made their first contribution in #1182
- @tisorlawan made their first contribution in #1184
- @sadaisystems made their first contribution in #1190
Full Changelog: v0.3.0...v0.4.0