Skip to content

Commit

Permalink
Merge branch 'sd3' into new_cache
Browse files Browse the repository at this point in the history
  • Loading branch information
kohya-ss committed Dec 4, 2024
2 parents 744cf03 + 8b36d90 commit b72b9ea
Show file tree
Hide file tree
Showing 25 changed files with 1,603 additions and 127 deletions.
42 changes: 42 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@

name: Python package

on: [push]

jobs:
build:

runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-latest]
python-version: ["3.10"]

steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install dependencies
run: python -m pip install --upgrade pip setuptools wheel

- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'
cache: 'pip' # caching pip dependencies

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install dadaptation==3.2 torch==2.4.0 torchvision==0.19.0 accelerate==0.33.0
pip install -r requirements.txt
- name: Test with pytest
run: |
pip install pytest
pytest
8 changes: 5 additions & 3 deletions .github/workflows/typos.yml
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
---
# yamllint disable rule:line-length
name: Typos

on: # yamllint disable-line rule:truthy
on:
push:
branches:
- main
- dev
pull_request:
types:
- opened
Expand All @@ -18,4 +20,4 @@ jobs:
- uses: actions/checkout@v4

- name: typos-action
uses: crate-ci/typos@v1.24.3
uses: crate-ci/typos@v1.28.1
45 changes: 45 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,26 @@ The command to install PyTorch is as follows:

### Recent Updates


Dec 3, 2024:

-`--blocks_to_swap` now works in FLUX.1 ControlNet training. Sample commands for 24GB VRAM and 16GB VRAM are added [here](#flux1-controlnet-training).

Dec 2, 2024:

- FLUX.1 ControlNet training is supported. PR [#1813](https://github.com/kohya-ss/sd-scripts/pull/1813). Thanks to minux302! See PR and [here](#flux1-controlnet-training) for details.
- Not fully tested. Feedback is welcome.
- 80GB VRAM is required for 1024x1024 resolution, and 48GB VRAM is required for 512x512 resolution.
- Currently, it only works in Linux environment (or Windows WSL2) because DeepSpeed is required.
- Multi-GPU training is not tested.

Dec 1, 2024:

- Pseudo Huber loss is now available for FLUX.1 and SD3.5 training. See PR [#1808](https://github.com/kohya-ss/sd-scripts/pull/1808) for details. Thanks to recris!
- Specify `--loss_type huber` or `--loss_type smooth_l1` to use it. `--huber_c` and `--huber_scale` are also available.

- [Prodigy + ScheduleFree](https://github.com/LoganBooker/prodigy-plus-schedule-free) is supported. See PR [#1811](https://github.com/kohya-ss/sd-scripts/pull/1811) for details. Thanks to rockerBOO!

Nov 14, 2024:

- Improved the implementation of block swap and made it available for both FLUX.1 and SD3 LoRA training. See [FLUX.1 LoRA training](#flux1-lora-training) etc. for how to use the new options. Training is possible with about 8-10GB of VRAM.
Expand All @@ -28,6 +48,7 @@ Nov 14, 2024:
- [Key Features for FLUX.1 LoRA training](#key-features-for-flux1-lora-training)
- [Specify rank for each layer in FLUX.1](#specify-rank-for-each-layer-in-flux1)
- [Specify blocks to train in FLUX.1 LoRA training](#specify-blocks-to-train-in-flux1-lora-training)
- [FLUX.1 ControlNet training](#flux1-controlnet-training)
- [FLUX.1 OFT training](#flux1-oft-training)
- [Inference for FLUX.1 with LoRA model](#inference-for-flux1-with-lora-model)
- [FLUX.1 fine-tuning](#flux1-fine-tuning)
Expand Down Expand Up @@ -245,6 +266,30 @@ example:

If you specify one of `train_double_block_indices` or `train_single_block_indices`, the other will be trained as usual.

### FLUX.1 ControlNet training
We have added a new training script for ControlNet training. The script is flux_train_control_net.py. See --help for options.

Sample command is below. It will work with 80GB VRAM GPUs.
```
accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_control_net.py
--pretrained_model_name_or_path flux1-dev.safetensors --clip_l clip_l.safetensors --t5xxl t5xxl_fp16.safetensors
--ae ae.safetensors --save_model_as safetensors --sdpa --persistent_data_loader_workers
--max_data_loader_n_workers 1 --seed 42 --gradient_checkpointing --mixed_precision bf16
--optimizer_type adamw8bit --learning_rate 2e-5
--highvram --max_train_epochs 1 --save_every_n_steps 1000 --dataset_config dataset.toml
--output_dir /path/to/output/dir --output_name flux-cn
--timestep_sampling shift --discrete_flow_shift 3.1582 --model_prediction_type raw --guidance_scale 1.0 --deepspeed
```

For 24GB VRAM GPUs, you can train with 16 blocks swapped and caching latents and text encoder outputs with the batch size of 1. Remove `--deepspeed` . Sample command is below. Not fully tested.
```
--blocks_to_swap 16 --cache_latents_to_disk --cache_text_encoder_outputs_to_disk
```

The training can be done with 16GB VRAM GPUs with around 30 blocks swapped.

`--gradient_accumulation_steps` is also available. The default value is 1 (no accumulation), but according to the original PR, 8 is used.

### FLUX.1 OFT training

You can train OFT with almost the same options as LoRA, such as `--timestamp_sampling`. The following points are different.
Expand Down
13 changes: 4 additions & 9 deletions fine_tune.py
Original file line number Diff line number Diff line change
Expand Up @@ -380,9 +380,7 @@ def fn_recursive_set_mem_eff(module: torch.nn.Module):

# Sample noise, sample a random timestep for each image, and add noise to the latents,
# with noise offset and/or multires noise if specified
noise, noisy_latents, timesteps, huber_c = train_util.get_noise_noisy_latents_and_timesteps(
args, noise_scheduler, latents
)
noise, noisy_latents, timesteps = train_util.get_noise_noisy_latents_and_timesteps(args, noise_scheduler, latents)

# Predict the noise residual
with accelerator.autocast():
Expand All @@ -394,11 +392,10 @@ def fn_recursive_set_mem_eff(module: torch.nn.Module):
else:
target = noise

huber_c = train_util.get_huber_threshold_if_needed(args, timesteps, noise_scheduler)
if args.min_snr_gamma or args.scale_v_pred_loss_like_noise_pred or args.debiased_estimation_loss:
# do not mean over batch dimension for snr weight or scale v-pred loss
loss = train_util.conditional_loss(
noise_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=huber_c
)
loss = train_util.conditional_loss(noise_pred.float(), target.float(), args.loss_type, "none", huber_c)
loss = loss.mean([1, 2, 3])

if args.min_snr_gamma:
Expand All @@ -410,9 +407,7 @@ def fn_recursive_set_mem_eff(module: torch.nn.Module):

loss = loss.mean() # mean over batch dimension
else:
loss = train_util.conditional_loss(
noise_pred.float(), target.float(), reduction="mean", loss_type=args.loss_type, huber_c=huber_c
)
loss = train_util.conditional_loss(noise_pred.float(), target.float(), args.loss_type, "mean", huber_c)

accelerator.backward(loss)
if accelerator.sync_gradients and args.max_grad_norm != 0.0:
Expand Down
5 changes: 2 additions & 3 deletions flux_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -676,9 +676,8 @@ def grad_hook(parameter: torch.Tensor):
target = noise - latents

# calculate loss
loss = train_util.conditional_loss(
model_pred.float(), target.float(), reduction="none", loss_type=args.loss_type, huber_c=None
)
huber_c = train_util.get_huber_threshold_if_needed(args, timesteps, noise_scheduler)
loss = train_util.conditional_loss(model_pred.float(), target.float(), args.loss_type, "none", huber_c)
if weighting is not None:
loss = loss * weighting
if args.masked_loss or ("alpha_masks" in batch and batch["alpha_masks"] is not None):
Expand Down
Loading

0 comments on commit b72b9ea

Please sign in to comment.