Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_grad_norm doesn't appear to be clipping gradients #2214

Open
6 of 8 tasks
DevonPeroutky opened this issue Dec 22, 2024 · 1 comment
Open
6 of 8 tasks

max_grad_norm doesn't appear to be clipping gradients #2214

DevonPeroutky opened this issue Dec 22, 2024 · 1 comment

Comments

@DevonPeroutky
Copy link

DevonPeroutky commented Dec 22, 2024

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I'm fine-tuning on a single GPU (no deepspeed configuration). I applied max_grad_norm: 1 and I would expect the normalized gradients to be clipped at 1 and never exceed.

Current behaviour

I'm seeing large gradient spikes on Weights & Biases?

Screenshot 2024-12-22 at 6 47 39 PM

Steps to reproduce

Train a qlora with axolotl with max_grad_norm: 1 and see the gradients be >1.

Config yaml

# -----------------------------------
# ---- Base Model Configuration -----
# -----------------------------------
base_model: meta-llama/Meta-Llama-3.1-8B
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: true
strict: false
chat_template: llama3

# -----------------------------------
# ------------ Dataset -------------
# -----------------------------------
datasets:
  # - path: databricks/databricks-dolly-15k
  - path: /home/ubuntu/kindo-base/notebooks/truncated_dolly_15K
    ds_type: json
    type:
      system_prompt: ""
      field_system: system
      field_instruction: instruction
      field_input: context
      field_output: response
      format: "[INST] {instruction} {input} [/INST]"
      no_input_format: "[INST] {instruction} [/INST]"
    train_split: train
dataset_prepared_path: last_run_prepared

# How much to set out across all datasets
val_set_size: .05
output_dir: ./outputs/qlora-out

# -----------------------------------
# ----------- Lora Config -----------
# -----------------------------------
adapter: qlora
lora_model_dir:

lora_r: 128
lora_alpha: 32 # alpha = r/4 is in the qlora paper
lora_dropout: 0.05
lora_target_modules:
lora_target_linear: true
lora_fan_in_fan_out:

lora_modules_to_save:
  - embed_tokens
  - lm_head

# -----------------------------------
# ------- Training parameters -------
# -----------------------------------
sequence_len: 4096
sample_packing: false
pad_to_sequence_len: true
gradient_accumulation_steps: 8
micro_batch_size: 8 
num_epochs: 2

optimizer: paged_adamw_8bit
max_grad_norm: 1.0

# Learning Rate
lr_scheduler: cosine
learning_rate: 0.0004
warmup_ratio: 0.05

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

evals_per_epoch: 3
eval_batch_size: 2
eval_table_size:
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|end_of_text|>"
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens: # these are delimiters
  - "<|im_start|>"
  - "<|im_end|>"

# -----------------------------------
# ------- Liger Integration ---------
# -----------------------------------
plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

Possible solution

  1. I'm wondering if the metrics reported to Weights & Biases are before clipping?
  2. The gradients are really, really high. Is it possible the way I'm mapping the dataset to the expected format for llama is wrong?

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10.13

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.
@DevonPeroutky DevonPeroutky added the bug Something isn't working label Dec 22, 2024
@DevonPeroutky DevonPeroutky changed the title max_grad_norm doesn't appear to be respected. max_grad_norm doesn't appear to be clipping gradients Dec 22, 2024
@NanoCode012
Copy link
Collaborator

For this config, we just pass through to Transformers Trainer to handle it: https://huggingface.co/docs/transformers/v4.47.1/en/main_classes/trainer#transformers.TrainingArguments.max_grad_norm

Should this be reported to the upstream repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants