Added Support for Apple Silicon #1289

shashikanth-a · 2024-11-14T08:39:11Z

Unoptimized
No gguf support yet.
Build Triton and bitsandbytes from source
cmake -DCOMPUTE_BACKEND=mps -S . for bitsandbytes building
pip install unsloth-zoo==2024.11.4
pip install xformers==0.0.25

- No gguf support yet. - Build Triton and bitsandbytes from source - `cmake -DCOMPUTE_BACKEND=hip -S .` for bitsandbytes building

yukiarimo · 2024-11-18T03:58:33Z

Is this working?

shimmyshimmer · 2024-11-20T01:48:53Z

Hi there thank you for this we will need a bit more time to review! :)

mkemka · 2024-11-21T17:32:43Z

Hi @shashikanth-a - thank you for this. Could you please provide information about the environment and package versions you used for development?

yukiarimo · 2024-11-21T18:58:56Z

Hey, does this works with newly released vision support?

mkemka · 2024-11-23T06:10:11Z

Currently I can run this if:

Decorators mentioning "@torch.compile(fullgraph = False, dynamic = True, options = torch_compile_options)" are removed in llama and Gemma files.
Fine tune llama-3-8b (3.2 1b and 3b throw an error due to rope for some reason.

- lazy loading of model - minor refactoring - optimizers and lr schedulers - gc - should improve memory consumption

mkemka · 2024-11-26T05:52:33Z

With the changes I can run this out of the box with the steps outlined above:

Build Triton from source and pip install -e .
Build bnb with cmake -DCOMPUTE_BACKEND=mps -S . and pip install -e .

On a M4 Pro getting around 100 t/s for llama3-8b. Can confirm it will also now work with llama-3.2-3b

shimmyshimmer · 2024-12-10T02:56:03Z

Thanks a lot - would anyone be so kind to benchmark this against MLX itself and share results?

Time it took, amount of VRAM, context length, if the losses match - ofcourse it's a lot so just time and checking to see if the losses match would be more than helpful. Thank you so much! :)

mkemka · 2025-01-03T09:19:05Z

Sorry for the delay.
The test is fine-tuning the above PR compared to out of the box mlx lora fine tune with same model and same dataset
M4 Mac Pro - 48GB Model.
Dataset is mlx-community/wikisql that I converted from the mlx format back to the normal hf format for unsloth.

Unsloth run

python unsloth-cli.py --model_name "unsloth/llama-3-8b" --max_seq_length 8192 --dtype None --load_in_4bit --r 4 --lora_alpha 4 --lora_dropout 0.1 --bias "none" --use_gradient_checkpointing "unsloth" --random_state 3407 --use_rslora --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --warmup_steps 5 --max_steps 100 --learning_rate 2e-6 --logging_steps 1 --optim "adamw_8bit" --weight_decay 0.005 --lr_scheduler_type "linear" --seed 3407 --output_dir "outputs" --report_to "tensorboard" --save_model --save_path "model" --dataset data/

Data is formatted and ready!
Trainable parameters: 0.021% (1.704M/8030.261M)
Starting training..., iters: 100
Iter 1: Val loss 1.889, Val took 24.562s
Iter 10: Train loss 1.848, Learning Rate 1.200e-06, It/sec 0.474, Tokens/sec 131.368, Trained Tokens 2769, Peak mem 17.353 GB
Iter 20: Train loss 1.827, Learning Rate 2.000e-06, It/sec 0.472, Tokens/sec 128.186, Trained Tokens 5483, Peak mem 17.353 GB
Iter 30: Train loss 1.875, Learning Rate 2.000e-06, It/sec 0.492, Tokens/sec 134.175, Trained Tokens 8212, Peak mem 17.353 GB
Iter 40: Train loss 1.841, Learning Rate 2.000e-06, It/sec 0.494, Tokens/sec 132.973, Trained Tokens 10903, Peak mem 17.353 GB
Iter 50: Train loss 1.810, Learning Rate 2.000e-06, It/sec 0.478, Tokens/sec 131.516, Trained Tokens 13654, Peak mem 17.353 GB
Iter 60: Train loss 1.804, Learning Rate 2.000e-06, It/sec 0.437, Tokens/sec 119.466, Trained Tokens 16387, Peak mem 17.353 GB
Iter 70: Train loss 1.835, Learning Rate 2.000e-06, It/sec 0.480, Tokens/sec 126.941, Trained Tokens 19030, Peak mem 17.353 GB
Iter 80: Train loss 1.723, Learning Rate 2.000e-06, It/sec 0.435, Tokens/sec 115.940, Trained Tokens 21693, Peak mem 17.353 GB
Iter 90: Train loss 1.743, Learning Rate 2.000e-06, It/sec 0.427, Tokens/sec 115.289, Trained Tokens 24393, Peak mem 17.353 GB
Iter 100: Val loss 1.600, Val took 26.121s
Iter 100: Train loss 1.724, Learning Rate 2.000e-06, It/sec 2.737, Tokens/sec 709.761, Trained Tokens 26986, Peak mem 17.353 GB

MLX Run

mlx_lm.lora \
  --model "unsloth/llama-3-8b" \
  --train \
  --data "mlx-community/wikisql" \
  --iters 100 \
  --batch-size 1 \
  --learning-rate 2e-6 \
  --weight-decay 0.005 \
  --seed 3407 \
  --adapter-path "outputs" \
  --grad-checkpoint \
  --max-seq-length 8192

Loading datasets
Loading Hugging Face dataset mlx-community/wikisql.
Training
Trainable parameters: 0.042% (3.408M/8030.261M)
Starting training..., iters: 100
Iter 1: Val loss 2.931, Val took 9.261s
Iter 10: Train loss 3.096, Learning Rate 2.000e-06, It/sec 1.238, Tokens/sec 92.346, Trained Tokens 746, Peak mem 15.299 GB
Iter 20: Train loss 3.045, Learning Rate 2.000e-06, It/sec 1.341, Tokens/sec 99.536, Trained Tokens 1488, Peak mem 15.326 GB
Iter 30: Train loss 2.504, Learning Rate 2.000e-06, It/sec 1.217, Tokens/sec 97.619, Trained Tokens 2290, Peak mem 15.330 GB
Iter 40: Train loss 2.347, Learning Rate 2.000e-06, It/sec 1.330, Tokens/sec 105.073, Trained Tokens 3080, Peak mem 15.330 GB
Iter 50: Train loss 2.430, Learning Rate 2.000e-06, It/sec 1.282, Tokens/sec 99.348, Trained Tokens 3855, Peak mem 15.330 GB
Iter 60: Train loss 2.148, Learning Rate 2.000e-06, It/sec 1.185, Tokens/sec 103.256, Trained Tokens 4726, Peak mem 15.330 GB
Iter 70: Train loss 1.879, Learning Rate 2.000e-06, It/sec 1.173, Tokens/sec 104.301, Trained Tokens 5615, Peak mem 15.571 GB
Iter 80: Train loss 1.972, Learning Rate 2.000e-06, It/sec 1.229, Tokens/sec 94.750, Trained Tokens 6386, Peak mem 15.571 GB
Iter 90: Train loss 1.845, Learning Rate 2.000e-06, It/sec 1.234, Tokens/sec 103.314, Trained Tokens 7223, Peak mem 15.571 GB
Iter 100: Val loss 1.641, Val took 7.520s
Iter 100: Train loss 1.715, Learning Rate 2.000e-06, It/sec 17.545, Tokens/sec 1336.898, Trained Tokens 7985, Peak mem 15.571 GB
Iter 100: Saved adapter weights to outputs/adapters.safetensors and outputs/0000100_adapters.safetensors.
Saved final weights to outputs/adapters.safetensors.

I can already see the parameter need to be reviewed since the trainable percentage of the models is different.
If this direction is useful I can keep looking at it.

noaebbot · 2025-01-08T08:50:47Z

Was able to make this work! Thanks for this! But the unsloth-zoo==2014.11.4 did not work for me, some functions were missing. Was able to make it run with version 2014.11.6

Added Support for Apple Silicon

c0980fd

- No gguf support yet. - Build Triton and bitsandbytes from source - `cmake -DCOMPUTE_BACKEND=hip -S .` for bitsandbytes building

minor fixes and enhancements

066c227

- lazy loading of model - minor refactoring - optimizers and lr schedulers - gc - should improve memory consumption

shashikanth-a added 3 commits November 28, 2024 10:14

4 bit quantized models added

df72331

Merge branch 'main' into apple_silicon_support

38e3cfc

merge fixes

a246add

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Support for Apple Silicon #1289

Added Support for Apple Silicon #1289

shashikanth-a commented Nov 14, 2024 •

edited

Loading

yukiarimo commented Nov 18, 2024

shimmyshimmer commented Nov 20, 2024

mkemka commented Nov 21, 2024

yukiarimo commented Nov 21, 2024

mkemka commented Nov 23, 2024

mkemka commented Nov 26, 2024

shimmyshimmer commented Dec 10, 2024

mkemka commented Jan 3, 2025 •

edited

Loading

noaebbot commented Jan 8, 2025

Added Support for Apple Silicon #1289

Are you sure you want to change the base?

Added Support for Apple Silicon #1289

Conversation

shashikanth-a commented Nov 14, 2024 • edited Loading

yukiarimo commented Nov 18, 2024

shimmyshimmer commented Nov 20, 2024

mkemka commented Nov 21, 2024

yukiarimo commented Nov 21, 2024

mkemka commented Nov 23, 2024

mkemka commented Nov 26, 2024

shimmyshimmer commented Dec 10, 2024

mkemka commented Jan 3, 2025 • edited Loading

noaebbot commented Jan 8, 2025

shashikanth-a commented Nov 14, 2024 •

edited

Loading

mkemka commented Jan 3, 2025 •

edited

Loading