[ INSTALL COMMAND ]
npx claude-code-templates@latest --skill fine-tuning-unsloth/references
[ PROMPT_NODE_22435 ]

Llms Full

[ SKILL_DOCUMENTATION ]
" "
" "
" "{{ '>>> User: ' + message['content'] + 'n' }}" "
" "{{ '>>> Assistant: ' + message['content'] + eos_token + 'n' }}" "
" "
" "
" "{{ '>>> Assistant: ' }}" "
" unsloth_eos_token = "eos_token" tokenizer = get_chat_template( tokenizer, chat_template = (unsloth_template, unsloth_eos_token,), # You must provide a template and EOS token mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style map_eos_token = True, # Maps to instead ) ``` # Quantization-Aware Training (QAT) Quantize models to 4-bit with Unsloth and PyTorch to recover accuracy. In collaboration with PyTorch, we're introducing QAT (Quantization-Aware Training) in Unsloth to enable **trainable quantization** that recovers as much accuracy as possible. This results in significantly better model quality compared to standard 4-bit naive quantization. QAT can recover up to **70% of the lost accuracy** and achieve a **1–3%** model performance improvement on benchmarks such as GPQA and MMLU Pro. > **Try QAT with our free** [**Qwen3 (4B) notebook**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)_Instruct-QAT.ipynb) ### :books:Quantization {% columns %} {% column width="50%" %} Naively quantizing a model is called **post-training quantization** (PTQ). For example, assume we want to quantize to 8bit integers: 1. Find `max(abs(W))` 2. Find `a = 127/max(abs(W))` where a is int8's maximum range which is 127 3. Quantize via `qW = int8(round(W * a))` {% endcolumn %} {% column width="50%" %}
{% endcolumn %} {% endcolumns %} Dequantizing back to 16bits simply does the reverse operation by `float16(qW) / a` . Post-training quantization (PTQ) can greatly reduce storage and inference costs, but quite often degrades accuracy when representing high-precision values with fewer bits - especially at 4-bit or lower. One way to solve this to utilize our [**dynamic GGUF quants**](https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs), which uses a calibration dataset to change the quantization procedure to allocate more importance to important weights. The other way is to make **quantization smarter, by making it trainable or learnable**! ### :fire:Smarter Quantization
To enable smarter quantization, we collaborated with the [TorchAO](https://github.com/pytorch/ao) team to add **Quantization-Aware Training (QAT)** directly inside of Unsloth - so now you can fine-tune models in Unsloth and then export them to 4-bit QAT format directly with accuracy improvements! In fact, **QAT recovers 66.9%** of Gemma3-4B on GPQA, and increasing the raw accuracy by +1.0%. Gemma3-12B on BBH recovers 45.5%, and **increased the raw accuracy by +2.1%**. QAT has no extra overhead during inference, and uses the same disk and memory usage as normal naive quantization! So you get all the benefits of low-bit quantization, but with much increased accuracy! ### :mag:Quantization-Aware Training QAT simulates the true quantization procedure by "**fake quantizing**" weights and optionally activations during training, which typically means rounding high precision values to quantized ones (while staying in high precision dtype, e.g. bfloat16) and then immediately dequantizing them. TorchAO enables QAT by first (1) inserting fake quantize operations into linear layers, and (2) transforms the fake quantize operations to actual quantize and dequantize operations after training to make it inference ready. Step 1 enables us to train a more accurate quantization representation.
### :sparkles:QAT + LoRA finetuning QAT in Unsloth can additionally be combined with LoRA fine-tuning to enable the benefits of both worlds: significantly reducing storage and compute requirements during training while mitigating quantization degradation! We support multiple methods via `qat_scheme` including `fp8-int4`, `fp8-fp8`, `int8-int4`, `int4` . We also plan to add custom definitions for QAT in a follow up release! {% code overflow="wrap" %} ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Qwen3-4B-Instruct-2507", max_seq_length = 2048, load_in_16bit = True, ) model = FastLanguageModel.get_peft_model( model, r = 16, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], lora_alpha = 32, # We support fp8-int4, fp8-fp8, int8-int4, int4 qat_scheme = "int4", ) ``` {% endcode %} ### :teapot:Exporting QAT models After fine-tuning in Unsloth, you can call `model.save_pretrained_torchao` to save your trained model using TorchAO’s PTQ format. You can also upload these to the HuggingFace hub! We support any config, and we plan to make text based methods as well, and to make the process more simpler for everyone! But first, we have to prepare the QAT model for the final conversion step via: {% code overflow="wrap" %} ```python from torchao.quantization import quantize_ from torchao.quantization.qat import QATConfig quantize_(model, QATConfig(step = "convert")) ``` {% endcode %} And now we can select which QAT style you want: {% code overflow="wrap" %} ```python # Use the exact same config as QAT (convenient function) model.save_pretrained_torchao( model, "tokenizer", torchao_config = model._torchao_config.base_config, ) # Int4 QAT from torchao.quantization import Int4WeightOnlyConfig model.save_pretrained_torchao( model, "tokenizer", torchao_config = Int4WeightOnlyConfig(), ) # Int8 QAT from torchao.quantization import Int8DynamicActivationInt8WeightConfig model.save_pretrained_torchao( model, "tokenizer", torchao_config = Int8DynamicActivationInt8WeightConfig(), ) ``` {% endcode %} You can then run the merged QAT lower precision model in vLLM, Unsloth and other systems for inference! These are all in the [Qwen3-4B QAT Colab notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)_Instruct-QAT.ipynb) we have as well! ### :teapot:Quantizing models without training You can also call `model.save_pretrained_torchao` directly without doing any QAT as well! This is simply PTQ or native quantization. For example, saving to Dynamic float8 format is below: {% code overflow="wrap" %} ```python # Float8 from torchao.quantization import PerRow from torchao.quantization import Float8DynamicActivationFloat8WeightConfig torchao_config = Float8DynamicActivationFloat8WeightConfig(granularity = PerRow()) model.save_pretrained_torchao(torchao_config = torchao_config) ``` {% endcode %} ### :mobile_phone:ExecuTorch - QAT for mobile deployment {% columns %} {% column %} With Unsloth and TorchAO’s QAT support, you can also fine-tune a model in Unsloth and seamlessly export it to [ExecuTorch](https://github.com/pytorch/executorch) (PyTorch’s solution for on-device inference) and deploy it directly on mobile. See an example in action [here](https://huggingface.co/metascroy/Qwen3-4B-int8-int4-unsloth) with more detailed workflows on the way! **Announcement coming soon!** {% endcolumn %} {% column %}
{% endcolumn %} {% endcolumns %} ### :sunflower:How to enable QAT Update Unsloth to the latest version, and also install the latest TorchAO! Then **try QAT with our free** [**Qwen3 (4B) notebook**](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_(4B)_Instruct-QAT.ipynb) {% code overflow="wrap" %} ```bash pip install --upgrade --no-cache-dir --force-reinstall unsloth unsloth_zoo pip install torchao==0.14.0 fbgemm-gpu-genai==1.3.0 ``` {% endcode %} ### :person_tipping_hand:Acknowledgements Huge thanks to the entire PyTorch and TorchAO team for their help and collaboration! Extreme thanks to Andrew Or, Jerry Zhang, Supriya Rao, Scott Roy and Mergen Nachin for helping on many discussions on QAT, and on helping to integrate it into Unsloth! Also thanks to the Executorch team as well! # Unsloth Environment Flags Advanced flags which might be useful if you see breaking finetunes, or you want to turn stuff off.
Environment variablePurpose
os.environ["UNSLOTH_RETURN_LOGITS"] = "1"Forcibly returns logits - useful for evaluation if logits are needed.
os.environ["UNSLOTH_COMPILE_DISABLE"] = "1"Disables auto compiler. Could be useful to debug incorrect finetune results.
os.environ["UNSLOTH_DISABLE_FAST_GENERATION"] = "1"Disables fast generation for generic models.
os.environ["UNSLOTH_ENABLE_LOGGING"] = "1"Enables auto compiler logging - useful to see which functions are compiled or not.
os.environ["UNSLOTH_FORCE_FLOAT32"] = "1"On float16 machines, use float32 and not float16 mixed precision. Useful for Gemma 3.
os.environ["UNSLOTH_STUDIO_DISABLED"] = "1"Disables extra features.
os.environ["UNSLOTH_COMPILE_DEBUG"] = "1"Turns on extremely verbose torch.compilelogs.
os.environ["UNSLOTH_COMPILE_MAXIMUM"] = "0"Enables maximum torch.compileoptimizations - not recommended.
os.environ["UNSLOTH_COMPILE_IGNORE_ERRORS"] = "1"Can turn this off to enable fullgraph parsing.
os.environ["UNSLOTH_FULLGRAPH"] = "0"Enable torch.compile fullgraph mode
os.environ["UNSLOTH_DISABLE_AUTO_UPDATES"] = "1"Forces no updates to unsloth-zoo
Another possiblity is maybe the model uploads we uploaded are corrupted, but unlikely. Try the following: ```python model, tokenizer = FastVisionModel.from_pretrained( "Qwen/Qwen2-VL-7B-Instruct", use_exact_model_name = True, ) ``` # Continued Pretraining AKA as Continued Finetuning. Unsloth allows you to continually pretrain so a model can learn a new language. * The [text completion notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_(7B)-Text_Completion.ipynb) is for continued pretraining/raw text. * The [continued pretraining notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Mistral_v0.3_(7B)-CPT.ipynb) is for learning another language. You can read more about continued pretraining and our release in our [blog post](https://unsloth.ai/blog/contpretraining). ## What is Continued Pretraining? Continued or continual pretraining (CPT) is necessary to “steer” the language model to understand new domains of knowledge, or out of distribution domains. Base models like Llama-3 8b or Mistral 7b are first pretrained on gigantic datasets of trillions of tokens (Llama-3 for e.g. is 15 trillion). But sometimes these models have not been well trained on other languages, or text specific domains, like law, medicine or other areas. So continued pretraining (CPT) is necessary to make the language model learn new tokens or datasets. ## Advanced Features: ### Loading LoRA adapters for continued finetuning If you saved a LoRA adapter through Unsloth, you can also continue training using your LoRA weights. The optimizer state will be reset as well. To load even optimizer states to continue finetuning, see the next section. ```python from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name = "LORA_MODEL_NAME", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) trainer = Trainer(...) trainer.train() ``` ### Continued Pretraining & Finetuning the `lm_head` and `embed_tokens` matrices Add `lm_head` and `embed_tokens`. For Colab, sometimes you will go out of memory for Llama-3 8b. If so, just add `lm_head`. ```python model = FastLanguageModel.get_peft_model( model, r = 16, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "lm_head", "embed_tokens",], lora_alpha = 16, ) ``` Then use 2 different learning rates - a 2-10x smaller one for the `lm_head` or `embed_tokens` like so: ```python from unsloth import UnslothTrainer, UnslothTrainingArguments trainer = UnslothTrainer( .... args = UnslothTrainingArguments( .... learning_rate = 5e-5, embedding_learning_rate = 5e-6, # 2-10x smaller than learning_rate ), ) ``` # Unsloth Benchmarks Unsloth recorded benchmarks on NVIDIA GPUs. * For more detailed benchmarks, read our [Llama 3.3 Blog](https://unsloth.ai/blog/llama3-3). * Benchmarking of Unsloth was also conducted by [🤗Hugging Face](https://huggingface.co/blog/unsloth-trl). Tested on H100 and [Blackwell](https://docs.unsloth.ai/basics/fine-tuning-llms-with-blackwell-rtx-50-series-and-unsloth) GPUs. We tested using the Alpaca Dataset, a batch size of 2, gradient accumulation steps of 4, rank = 32, and applied QLoRA on all linear layers (q, k, v, o, gate, up, down):
ModelVRAM🦥Unsloth speed🦥VRAM reduction🦥Longer context😊Hugging Face + FA2
Llama 3.3 (70B)80GB2x>75%13x longer1x
Llama 3.1 (8B)80GB2x>70%12x longer1x
## Context length benchmarks {% hint style="info" %} The more data you have, the less VRAM Unsloth uses due to our [gradient checkpointing](https://unsloth.ai/blog/long-context) algorithm + Apple's CCE algorithm! {% endhint %} ### **Llama 3.1 (8B) max. context length** We tested Llama 3.1 (8B) Instruct and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads. | GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 | | -------- | ------------------------ | ------------------ | | 8 GB | 2,972 | OOM | | 12 GB | 21,848 | 932 | | 16 GB | 40,724 | 2,551 | | 24 GB | 78,475 | 5,789 | | 40 GB | 153,977 | 12,264 | | 48 GB | 191,728 | 15,502 | | 80 GB | 342,733 | 28,454 | ### **Llama 3.3 (70B) max. context length** We tested Llama 3.3 (70B) Instruct on a 80GB A100 and did 4bit QLoRA on all linear layers (Q, K, V, O, gate, up and down) with rank = 32 with a batch size of 1. We padded all sequences to a certain maximum sequence length to mimic long context finetuning workloads. | GPU VRAM | 🦥Unsloth context length | Hugging Face + FA2 | | -------- | ------------------------ | ------------------ | | 48 GB | 12,106 | OOM | | 80 GB | 89,389 | 6,916 |
Source: claude-code-templates (MIT). See About Us for full credits.
[ PARAMETER_INJECTION ]

SCANNING_VARIABLES...

USAGE_GUIDE:
  • Paste it into your AI terminal or config file.