[ PROMPT_NODE_22801 ]
Memory Optimization
[ SKILL_DOCUMENTATION ]
# Memory Optimization
Complete guide to CPU offloading, gradient checkpointing, memory profiling, and advanced memory-saving strategies with bitsandbytes.
## Overview
Memory optimization techniques for fitting large models:
- **Quantization**: 50-75% reduction (covered in other docs)
- **CPU offloading**: Move weights to CPU/disk
- **Gradient checkpointing**: Trade compute for memory
- **Optimizer strategies**: 8-bit, paged optimizers
- **Mixed precision**: FP16/BF16 training
## CPU Offloading
### Basic CPU Offloading
Move parts of the model to CPU RAM when not in use.
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=config,
device_map="auto", # Automatic device placement
max_memory={0: "40GB", "cpu": "100GB"} # 40GB GPU, 100GB CPU
)
```
**How it works**:
- Weights stored on CPU
- Moved to GPU only when needed for computation
- Automatically managed by `accelerate`
**Trade-off**: ~5-10× slower but enables larger models
### Multi-GPU Offloading
Distribute across multiple GPUs + CPU:
```python
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-405b-hf",
quantization_config=config,
device_map="auto",
max_memory={
0: "70GB", # GPU 0
1: "70GB", # GPU 1
2: "70GB", # GPU 2
3: "70GB", # GPU 3
"cpu": "200GB" # CPU RAM
}
)
```
**Result**: 405B model (4-bit = ~200GB) fits on 4×80GB GPUs + CPU
### Disk Offloading
For models too large even for CPU RAM:
```python
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-405b-hf",
quantization_config=config,
device_map="auto",
offload_folder="./offload", # Disk offload directory
offload_state_dict=True,
max_memory={0: "40GB", "cpu": "50GB"}
)
```
**Trade-off**: Extremely slow (~100× slower) but works
### Manual Device Mapping
For precise control:
```python
device_map = {
"model.embed_tokens": 0, # GPU 0
"model.layers.0": 0,
"model.layers.1": 0,
# ...
"model.layers.40": 1, # GPU 1
"model.layers.41": 1,
# ...
"model.layers.79": "cpu", # CPU
"model.norm": "cpu",
"lm_head": "cpu"
}
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=config,
device_map=device_map
)
```
## Gradient Checkpointing
Recompute activations during backward pass instead of storing them.
### Enable for HuggingFace Models
```python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-13b-hf",
quantization_config=config
)
# Enable gradient checkpointing
model.gradient_checkpointing_enable()
```
**Memory savings**: ~30-50% activation memory
**Cost**: ~20% slower training
### With QLoRA
```python
from peft import prepare_model_for_kbit_training
# Enable gradient checkpointing before preparing for training
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(
model,
use_gradient_checkpointing=True
)
```
### Configure Checkpointing Frequency
```python
# Checkpoint every layer (maximum memory savings)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})
```
### Memory Breakdown
Example: Llama 2 13B forward pass
| Component | Without Checkpointing | With Checkpointing |
|-----------|----------------------|-------------------|
| Model weights | 26 GB | 26 GB |
| Activations | 12 GB | **3 GB** |
| Gradients | 26 GB | 26 GB |
| Optimizer | 52 GB | 52 GB |
| **Total** | 116 GB | **107 GB** |
**Savings**: ~9GB for 13B model
## 8-Bit Optimizers
Use 8-bit optimizer states instead of 32-bit.
### Standard AdamW Memory
```
Optimizer memory = 2 × model_params × 4 bytes (FP32)
= 8 × model_params
Example (Llama 2 70B):
= 8 × 70B = 560 GB
```
### 8-Bit AdamW Memory
```
Optimizer memory = 2 × model_params × 1 byte (INT8)
= 2 × model_params
Example (Llama 2 70B):
= 2 × 70B = 140 GB
Savings: 420 GB (75% reduction!)
```
### Enable in Transformers
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=4,
optim="paged_adamw_8bit", # 8-bit optimizer
learning_rate=2e-4
)
```
### Available 8-Bit Optimizers
| Optimizer | Name | Use Case |
|-----------|------|----------|
| AdamW 8-bit | `adamw_8bit` | General training |
| Paged AdamW 8-bit | `paged_adamw_8bit` | **Recommended** (prevents OOM) |
| Paged AdamW 32-bit | `paged_adamw_32bit` | High accuracy needed |
**Recommendation**: Always use `paged_adamw_8bit`
### Manual Usage
```python
import bitsandbytes as bnb
optimizer = bnb.optim.PagedAdamW8bit(
model.parameters(),
lr=1e-4,
betas=(0.9, 0.999),
eps=1e-8
)
```
## Paged Optimizers
Paged optimizers use unified memory (GPU + CPU) to prevent OOM.
### How It Works
- Optimizer states stored in paged memory
- Pages swap between GPU and CPU as needed
- Prevents hard OOM crashes
### Configuration
```python
from transformers import TrainingArguments
training_args = TrainingArguments(
optim="paged_adamw_8bit", # Enables paging
# Paging happens automatically
)
```
### Benefits
✅ No hard OOM (graceful degradation)
✅ Enables larger batch sizes
✅ Combines with 8-bit for maximum savings
### Performance
**Speed**: ~5-10% slower than standard optimizer
**Memory**: Effectively unlimited (uses CPU + swap)
## Mixed Precision Training
Use lower precision for faster training and less memory.
### BF16 Training (Recommended)
```python
training_args = TrainingArguments(
bf16=True, # BFloat16 training
bf16_full_eval=True
)
```
**Requirements**: Ampere+ GPUs (A100, H100, RTX 3090+)
**Benefits**:
- 2× faster training
- 50% less activation memory
- Better stability than FP16
### FP16 Training
```python
training_args = TrainingArguments(
fp16=True, # Float16 training
fp16_full_eval=True
)
```
**Requirements**: Volta+ GPUs (V100, A100, RTX 2080+)
**Benefits**:
- 2× faster training
- 50% less activation memory
- Slightly less stable than BF16
### Precision Comparison
| Precision | Speed | Memory | Stability | Use Case |
|-----------|-------|--------|-----------|----------|
| FP32 | 1× | 100% | Best | Debugging |
| BF16 | 2× | 50% | Good | **Recommended** |
| FP16 | 2× | 50% | Fair | V100 only |
## Complete Memory Optimization Stack
### Maximum Optimization (Llama 2 70B on Single A100 80GB)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# Step 1: 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto",
max_memory={0: "70GB", "cpu": "100GB"} # CPU offload if needed
)
# Step 2: Gradient checkpointing
model.gradient_checkpointing_enable()
# Step 3: Prepare for training
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True)
# Step 4: LoRA adapters
lora_config = LoraConfig(
r=16, # Lower rank for memory
lora_alpha=32,
target_modules="all-linear",
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
# Step 5: Training arguments
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=1, # Small batch
gradient_accumulation_steps=16, # Effective batch = 16
bf16=True, # Mixed precision
optim="paged_adamw_8bit", # 8-bit optimizer
max_grad_norm=0.3,
learning_rate=2e-4
)
# Memory usage: ~75GB (fits on A100 80GB!)
```
### Memory Breakdown
| Component | Memory |
|-----------|--------|
| Model (4-bit) | 35 GB |
| LoRA adapters | 0.5 GB |
| Activations (with checkpointing) | 8 GB |
| Gradients | 0.5 GB |
| Optimizer (8-bit paged) | 1 GB |
| Batch buffer | 10 GB |
| CUDA overhead | 5 GB |
| **Total** | **~75 GB** |
## Memory Profiling
### PyTorch Memory Profiler
```python
import torch
# Start profiling
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
# Your code here
model = AutoModelForCausalLM.from_pretrained(...)
model.generate(...)
# Check memory
print(f"Allocated: {torch.cuda.memory_allocated()/1e9:.2f} GB")
print(f"Peak: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
print(f"Cached: {torch.cuda.memory_reserved()/1e9:.2f} GB")
```
### Detailed Memory Summary
```python
print(torch.cuda.memory_summary())
```
Output:
```
|===========================================================================|
| PyTorch CUDA memory summary |
|---------------------------------------------------------------------------|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 45.2 GB | 52.3 GB | 156.8 GB | 111.6 GB |
| Active memory | 45.2 GB | 52.3 GB | 156.8 GB | 111.6 GB |
| GPU reserved | 46.0 GB | 54.0 GB | 54.0 GB | 8.0 GB |
|===========================================================================|
```
### Track Memory During Training
```python
from transformers import TrainerCallback
class MemoryCallback(TrainerCallback):
def on_step_end(self, args, state, control, **kwargs):
if state.global_step % 10 == 0:
allocated = torch.cuda.memory_allocated() / 1e9
reserved = torch.cuda.memory_reserved() / 1e9
print(f"Step {state.global_step}: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")
trainer = Trainer(
model=model,
args=training_args,
callbacks=[MemoryCallback()]
)
```
## Troubleshooting OOM
### Diagnostic Steps
1. **Check current memory**:
```python
print(torch.cuda.memory_summary())
```
2. **Try smaller batch**:
```python
per_device_train_batch_size=1
```
3. **Enable gradient checkpointing**:
```python
model.gradient_checkpointing_enable()
```
4. **Use 8-bit optimizer**:
```python
optim="paged_adamw_8bit"
```
5. **Add CPU offloading**:
```python
max_memory={0: "70GB", "cpu": "100GB"}
```
6. **Reduce LoRA rank**:
```python
r=8 # Instead of 16
```
### Emergency: Last Resort
```python
# Absolute minimum memory config
model = AutoModelForCausalLM.from_pretrained(
"model-name",
quantization_config=BitsAndBytesConfig(load_in_4bit=True),
device_map="auto",
max_memory={0: "20GB", "cpu": "200GB"},
offload_folder="./offload"
)
model.gradient_checkpointing_enable()
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=64,
bf16=True,
optim="paged_adamw_8bit"
)
```
**Result**: Extremely slow but will probably work
## Best Practices
1. **Start with quantization**: 4-bit gives 75% savings
2. **Add gradient checkpointing**: 30-50% activation savings
3. **Use 8-bit optimizer**: 75% optimizer savings
4. **Enable mixed precision**: 50% activation savings
5. **CPU offload only if needed**: Slow but enables larger models
6. **Profile regularly**: Identify memory bottlenecks
7. **Test with small batches**: Prevent OOM during development
## Memory Estimation Formula
```
Total Memory = Model + Activations + Gradients + Optimizer + Buffer
Model = Parameters × Bytes per param
Activations = Batch × Seq × Hidden × Layers × Bytes per activation
Gradients = Parameters × Bytes per gradient
Optimizer = Parameters × Optimizer factor × Bytes
Buffer = 2-5 GB (CUDA overhead)
```
**With all optimizations**:
```
Model = Parameters × 0.5 (4-bit)
Activations = Activations × 0.3 (checkpointing + BF16)
Gradients = Parameters × 0.5 (LoRA only)
Optimizer = Parameters × 2 (8-bit)
```
## References
- PyTorch memory management: https://pytorch.org/docs/stable/notes/cuda.html
- Accelerate device_map: https://huggingface.co/docs/accelerate/usage_guides/big_modeling
- Gradient checkpointing: https://pytorch.org/docs/stable/checkpoint.html
- bitsandbytes optimizers: https://github.com/bitsandbytes-foundation/bitsandbytes#optimizer
Source: claude-code-templates (MIT). See About Us for full credits.