[ PROMPT_NODE_22821 ]
Calibration
[ SKILL_DOCUMENTATION ]
# GPTQ Calibration Guide
Complete guide to calibration data selection and quantization process.
## Calibration Data Selection
### Why calibration matters
Calibration data is used to:
1. **Compute weight importance** (Hessian matrix)
2. **Minimize quantization error** for important weights
3. **Preserve model accuracy** after quantization
**Impact**:
- Good calibration: <1.5% perplexity increase
- Poor calibration: 5-10% perplexity increase
- No calibration: Model may output gibberish
### Dataset size
**Recommended**:
- **128-256 samples** of 512 tokens each
- Total: 65K-131K tokens
**More is not always better**:
- 512 samples: Diminishing returns, slower quantization
### Dataset selection by domain
**General purpose models (GPT, Llama)**:
```python
from datasets import load_dataset
# C4 dataset (recommended for general models)
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
tokenizer(example["text"])["input_ids"][:512]
for example in dataset.take(128)
]
```
**Code models (CodeLlama, StarCoder)**:
```python
# The Stack dataset
dataset = load_dataset("bigcode/the-stack", split="train", streaming=True)
calibration_data = [
tokenizer(example["content"])["input_ids"][:512]
for example in dataset.take(128)
if example["lang"] == "Python" # Or your target language
]
```
**Chat models**:
```python
# ShareGPT or Alpaca format
dataset = load_dataset("anon8231489123/ShareGPT_Vicuna_unfiltered", split="train")
calibration_data = []
for example in dataset.select(range(128)):
# Format as conversation
conversation = tokenizer.apply_chat_template(
example["conversations"],
tokenize=True,
max_length=512
)
calibration_data.append(conversation)
```
**Domain-specific (medical, legal)**:
```python
# Use domain-specific text
dataset = load_dataset("medical_dataset", split="train")
calibration_data = [
tokenizer(example["text"])["input_ids"][:512]
for example in dataset.take(256) # More samples for niche domains
]
```
## Quantization Process
### Basic quantization
```python
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
from transformers import AutoTokenizer
from datasets import load_dataset
# 1. Load model
model_name = "meta-llama/Llama-2-7b-hf"
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False
)
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# 2. Prepare calibration data
dataset = load_dataset("c4", split="train", streaming=True)
calibration_data = [
tokenizer(example["text"])["input_ids"][:512]
for example in dataset.take(128)
]
# 3. Quantize
model.quantize(calibration_data)
# 4. Save
model.save_quantized("llama-2-7b-gptq")
```
**Time**: ~10-30 minutes for 7B model on A100
### Advanced configuration
```python
config = BaseQuantizeConfig(
bits=4, # 3, 4, or 8 bits
group_size=128, # 32, 64, 128, or -1 (per-column)
desc_act=False, # Activation order (True = better accuracy, slower)
damp_percent=0.01, # Dampening (0.001-0.1, default 0.01)
static_groups=False, # Static quantization
sym=True, # Symmetric quantization
true_sequential=True, # Sequential quantization (more accurate)
model_seqlen=2048 # Model sequence length
)
```
**Parameter tuning**:
- `damp_percent`: Lower = more accurate, slower. Try 0.005-0.02.
- `desc_act=True`: 0.5-1% better accuracy, 20-30% slower inference
- `group_size=32`: Better accuracy, slightly larger model
### Multi-GPU quantization
```python
# Quantize on multiple GPUs (faster)
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
quantize_config=config,
device_map="auto", # Distribute across GPUs
max_memory={0: "40GB", 1: "40GB"}
)
model.quantize(calibration_data)
```
## Quality Evaluation
### Perplexity testing
```python
from datasets import load_dataset
import torch
# Load test dataset
test_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
test_text = "nn".join(test_dataset["text"])
# Tokenize
encodings = tokenizer(test_text, return_tensors="pt")
max_length = model.seqlen
# Calculate perplexity
nlls = []
for i in range(0, encodings.input_ids.size(1), max_length):
begin_loc = i
end_loc = min(i + max_length, encodings.input_ids.size(1))
input_ids = encodings.input_ids[:, begin_loc:end_loc].to("cuda")
with torch.no_grad():
outputs = model(input_ids, labels=input_ids)
nll = outputs.loss
nlls.append(nll)
ppl = torch.exp(torch.stack(nlls).mean())
print(f"Perplexity: {ppl.item():.2f}")
```
**Quality targets**:
- 5% increase: Poor, redo calibration
### Benchmark evaluation
```python
from lm_eval import evaluator
# Evaluate on standard benchmarks
results = evaluator.simple_evaluate(
model=model,
tasks=["hellaswag", "mmlu", "arc_challenge"],
num_fewshot=5
)
print(results["results"])
# Compare to baseline FP16 scores
```
## Optimization Tips
### Improving accuracy
**1. Use more calibration samples**:
```python
# Try 256 or 512 samples
calibration_data = [... for example in dataset.take(256)]
```
**2. Use domain-specific data**:
```python
# Match your use case
if code_model:
dataset = load_dataset("bigcode/the-stack")
elif chat_model:
dataset = load_dataset("ShareGPT")
```
**3. Enable activation reordering**:
```python
config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=True # Better accuracy, slower inference
)
```
**4. Use smaller group size**:
```python
config = BaseQuantizeConfig(
bits=4,
group_size=32, # vs 128
desc_act=False
)
```
### Reducing quantization time
**1. Use fewer samples**:
```python
# 64-128 samples usually sufficient
calibration_data = [... for example in dataset.take(64)]
```
**2. Disable activation ordering**:
```python
config = BaseQuantizeConfig(
desc_act=False # Faster quantization
)
```
**3. Use multi-GPU**:
```python
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
device_map="auto" # Parallelize across GPUs
)
```
## Troubleshooting
### Poor quality after quantization
**Symptom**: >5% perplexity increase or gibberish output
**Solutions**:
1. **Check calibration data**:
```python
# Verify data is representative
for sample in calibration_data[:5]:
print(tokenizer.decode(sample))
```
2. **Try more samples**:
```python
calibration_data = [... for example in dataset.take(256)]
```
3. **Use domain-specific data**:
```python
# Match your model's use case
dataset = load_dataset("domain_specific_dataset")
```
4. **Adjust dampening**:
```python
config = BaseQuantizeConfig(damp_percent=0.005) # Lower dampening
```
### Quantization OOM
**Solutions**:
1. **Reduce batch size**:
```python
model.quantize(calibration_data, batch_size=1) # Default: auto
```
2. **Use CPU offloading**:
```python
model = AutoGPTQForCausalLM.from_pretrained(
model_name,
device_map="auto",
max_memory={"cpu": "100GB"}
)
```
3. **Quantize on larger GPU** or use multi-GPU
### Slow quantization
**Typical times** (7B model):
- Single A100: 10-15 minutes
- Single RTX 4090: 20-30 minutes
- CPU: 2-4 hours (not recommended)
**Speedup**:
- Use fewer samples (64 vs 256)
- Disable `desc_act`
- Use multi-GPU
## Best Practices
1. **Use C4 dataset for general models** - well-balanced, diverse
2. **Match domain** - code models need code data, chat needs conversations
3. **Start with 128 samples** - good balance of speed and quality
4. **Test perplexity** - always verify quality before deployment
5. **Compare kernels** - try ExLlama, Marlin, Triton for speed
6. **Save multiple versions** - try group_size 32, 128, 256
7. **Document settings** - save quantize_config.json for reproducibility
Source: claude-code-templates (MIT). See About Us for full credits.