[ PROMPT_NODE_22797 ]
Optimization Awq – Troubleshooting
[ SKILL_DOCUMENTATION ]
# AWQ Troubleshooting Guide
## Installation Issues
### CUDA Version Mismatch
**Error**: `RuntimeError: CUDA error: no kernel image is available for execution`
**Fix**: Install matching CUDA version:
```bash
# Check your CUDA version
nvcc --version
# Install matching autoawq
pip install autoawq --extra-index-url https://download.pytorch.org/whl/cu118 # For CUDA 11.8
pip install autoawq --extra-index-url https://download.pytorch.org/whl/cu121 # For CUDA 12.1
```
### Compute Capability Too Low
**Error**: `AssertionError: Compute capability must be >= 7.5`
**Fix**: AWQ requires NVIDIA GPUs with compute capability 7.5+ (Turing or newer):
- RTX 20xx series: 7.5 (supported)
- RTX 30xx series: 8.6 (supported)
- RTX 40xx series: 8.9 (supported)
- A100/H100: 8.0/9.0 (supported)
Older GPUs (GTX 10xx, V100) are not supported.
### Transformers Version Conflict
**Error**: `ImportError: cannot import name 'AwqConfig'`
**Fix**: AutoAWQ may downgrade transformers. Reinstall correct version:
```bash
pip install autoawq
pip install transformers>=4.45.0 --upgrade
```
### Triton Not Found (Linux)
**Error**: `ModuleNotFoundError: No module named 'triton'`
**Fix**:
```bash
pip install triton
# Or install with kernels
pip install autoawq[kernels]
```
## Quantization Issues
### CUDA Out of Memory During Quantization
**Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory`
**Solutions**:
1. **Reduce calibration samples**:
```python
model.quantize(
tokenizer,
quant_config=quant_config,
max_calib_samples=64 # Reduce from 128
)
```
2. **Use CPU offloading**:
```python
model = AutoAWQForCausalLM.from_pretrained(
model_path,
low_cpu_mem_usage=True
)
```
3. **Multi-GPU quantization**:
```python
model = AutoAWQForCausalLM.from_pretrained(
model_path,
device_map="auto"
)
```
### NaN in Weights After Quantization
**Error**: `AssertionError: NaN detected in weights`
**Cause**: Calibration data issues or numerical instability.
**Fix**:
```python
# Use more calibration samples
model.quantize(
tokenizer,
quant_config=quant_config,
max_calib_samples=256,
max_calib_seq_len=1024
)
```
### Empty Calibration Samples
**Error**: `ValueError: Calibration samples are empty`
**Fix**: Ensure tokenizer produces valid output:
```python
# Check tokenizer
test = tokenizer("test", return_tensors="pt")
print(f"Token count: {test.input_ids.shape[1]}")
# Use explicit calibration data
calib_data = ["Your sample text here..."] * 128
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)
```
### Unsupported Model Architecture
**Error**: `TypeError: 'model_type' is not supported`
**Cause**: Model architecture not in AWQ registry.
**Check supported models**:
```python
from awq.models import AWQ_CAUSAL_LM_MODEL_MAP
print(list(AWQ_CAUSAL_LM_MODEL_MAP.keys()))
```
**Supported**: llama, mistral, qwen2, falcon, mpt, phi, gemma, etc.
## Inference Issues
### Slow Inference Speed
**Problem**: Inference slower than expected.
**Solutions**:
1. **Enable layer fusion**:
```python
model = AutoAWQForCausalLM.from_quantized(
model_name,
fuse_layers=True
)
```
2. **Use correct kernel for batch size**:
```python
# For batch_size=1
quant_config = {"version": "GEMV"}
# For batch_size>1
quant_config = {"version": "GEMM"}
```
3. **Use Marlin on Ampere+ GPUs**:
```python
from transformers import AwqConfig
config = AwqConfig(bits=4, version="marlin")
```
### Wrong Output / Garbage Text
**Problem**: Model produces nonsensical output after quantization.
**Causes and fixes**:
1. **Poor calibration data**: Use domain-relevant data
```python
calib_data = [
"Relevant examples from your use case...",
]
model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data)
```
2. **Tokenizer mismatch**: Ensure same tokenizer
```python
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
```
3. **Check generation config**:
```python
outputs = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.7,
pad_token_id=tokenizer.eos_token_id
)
```
### FlashAttention2 Incompatibility
**Error**: `ValueError: Cannot use FlashAttention2 with fused modules`
**Fix**: Disable one or the other:
```python
# Option 1: Use fused modules (recommended for AWQ)
model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True)
# Option 2: Use FlashAttention2 without fusion
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="flash_attention_2",
device_map="auto"
)
```
### AMD GPU Issues
**Error**: `RuntimeError: ROCm/HIP not found`
**Fix**: Use ExLlama backend for AMD:
```python
from transformers import AwqConfig
config = AwqConfig(bits=4, version="exllama")
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=config
)
```
## Loading Issues
### Model Not Found
**Error**: `OSError: model_name is not a valid model identifier`
**Fix**: Check HuggingFace model exists:
```bash
# Search AWQ models
https://huggingface.co/models?library=awq
# Common AWQ model providers
TheBloke, teknium, Qwen, NousResearch
```
### Safetensors Error
**Error**: `safetensors_rust.SafetensorError: Error while deserializing`
**Fix**: Try loading without safetensors:
```python
model = AutoAWQForCausalLM.from_quantized(
model_name,
safetensors=False
)
```
### Device Map Conflicts
**Error**: `ValueError: You cannot use device_map with max_memory`
**Fix**: Use one or the other:
```python
# Auto device map
model = AutoAWQForCausalLM.from_quantized(model_name, device_map="auto")
# OR manual memory limits
model = AutoAWQForCausalLM.from_quantized(
model_name,
max_memory={0: "20GB", 1: "20GB"}
)
```
## vLLM Integration Issues
### Quantization Not Detected
**Error**: vLLM loads model in FP16 instead of quantized.
**Fix**: Explicitly specify quantization:
```python
from vllm import LLM
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq", # Explicitly set
dtype="half"
)
```
### Marlin Kernel Error in vLLM
**Error**: `RuntimeError: Marlin kernel not supported`
**Fix**: Check GPU compatibility:
```python
import torch
print(torch.cuda.get_device_capability()) # Must be >= (8, 0)
# If not supported, use GEMM
llm = LLM(model="...", quantization="awq") # Uses GEMM by default
```
## Performance Debugging
### Memory Usage Check
```python
import torch
def print_gpu_memory():
for i in range(torch.cuda.device_count()):
allocated = torch.cuda.memory_allocated(i) / 1e9
reserved = torch.cuda.memory_reserved(i) / 1e9
print(f"GPU {i}: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved")
print_gpu_memory()
```
### Profiling Inference
```python
import time
def benchmark_model(model, tokenizer, prompt, n_runs=5):
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
# Warmup
model.generate(**inputs, max_new_tokens=10)
torch.cuda.synchronize()
# Benchmark
times = []
for _ in range(n_runs):
start = time.perf_counter()
outputs = model.generate(**inputs, max_new_tokens=100)
torch.cuda.synchronize()
times.append(time.perf_counter() - start)
tokens = outputs.shape[1] - inputs.input_ids.shape[1]
avg_time = sum(times) / len(times)
print(f"Average: {tokens/avg_time:.2f} tokens/sec")
```
## Getting Help
1. **Check deprecation notice**: AutoAWQ is deprecated, use llm-compressor for new projects
2. **GitHub Issues**: https://github.com/casper-hansen/AutoAWQ/issues
3. **HuggingFace Forums**: https://discuss.huggingface.co/
4. **vLLM Discord**: For vLLM integration issues
Source: claude-code-templates (MIT). See About Us for full credits.