[ PROMPT_NODE_22797 ]

Optimization Awq – Troubleshooting

[ SKILL_DOCUMENTATION ]

# AWQ Troubleshooting Guide ## Installation Issues ### CUDA Version Mismatch **Error**: `RuntimeError: CUDA error: no kernel image is available for execution` **Fix**: Install matching CUDA version: ```bash # Check your CUDA version nvcc --version # Install matching autoawq pip install autoawq --extra-index-url https://download.pytorch.org/whl/cu118 # For CUDA 11.8 pip install autoawq --extra-index-url https://download.pytorch.org/whl/cu121 # For CUDA 12.1 ``` ### Compute Capability Too Low **Error**: `AssertionError: Compute capability must be >= 7.5` **Fix**: AWQ requires NVIDIA GPUs with compute capability 7.5+ (Turing or newer): - RTX 20xx series: 7.5 (supported) - RTX 30xx series: 8.6 (supported) - RTX 40xx series: 8.9 (supported) - A100/H100: 8.0/9.0 (supported) Older GPUs (GTX 10xx, V100) are not supported. ### Transformers Version Conflict **Error**: `ImportError: cannot import name 'AwqConfig'` **Fix**: AutoAWQ may downgrade transformers. Reinstall correct version: ```bash pip install autoawq pip install transformers>=4.45.0 --upgrade ``` ### Triton Not Found (Linux) **Error**: `ModuleNotFoundError: No module named 'triton'` **Fix**: ```bash pip install triton # Or install with kernels pip install autoawq[kernels] ``` ## Quantization Issues ### CUDA Out of Memory During Quantization **Error**: `torch.cuda.OutOfMemoryError: CUDA out of memory` **Solutions**: 1. **Reduce calibration samples**: ```python model.quantize( tokenizer, quant_config=quant_config, max_calib_samples=64 # Reduce from 128 ) ``` 2. **Use CPU offloading**: ```python model = AutoAWQForCausalLM.from_pretrained( model_path, low_cpu_mem_usage=True ) ``` 3. **Multi-GPU quantization**: ```python model = AutoAWQForCausalLM.from_pretrained( model_path, device_map="auto" ) ``` ### NaN in Weights After Quantization **Error**: `AssertionError: NaN detected in weights` **Cause**: Calibration data issues or numerical instability. **Fix**: ```python # Use more calibration samples model.quantize( tokenizer, quant_config=quant_config, max_calib_samples=256, max_calib_seq_len=1024 ) ``` ### Empty Calibration Samples **Error**: `ValueError: Calibration samples are empty` **Fix**: Ensure tokenizer produces valid output: ```python # Check tokenizer test = tokenizer("test", return_tensors="pt") print(f"Token count: {test.input_ids.shape[1]}") # Use explicit calibration data calib_data = ["Your sample text here..."] * 128 model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data) ``` ### Unsupported Model Architecture **Error**: `TypeError: 'model_type' is not supported` **Cause**: Model architecture not in AWQ registry. **Check supported models**: ```python from awq.models import AWQ_CAUSAL_LM_MODEL_MAP print(list(AWQ_CAUSAL_LM_MODEL_MAP.keys())) ``` **Supported**: llama, mistral, qwen2, falcon, mpt, phi, gemma, etc. ## Inference Issues ### Slow Inference Speed **Problem**: Inference slower than expected. **Solutions**: 1. **Enable layer fusion**: ```python model = AutoAWQForCausalLM.from_quantized( model_name, fuse_layers=True ) ``` 2. **Use correct kernel for batch size**: ```python # For batch_size=1 quant_config = {"version": "GEMV"} # For batch_size>1 quant_config = {"version": "GEMM"} ``` 3. **Use Marlin on Ampere+ GPUs**: ```python from transformers import AwqConfig config = AwqConfig(bits=4, version="marlin") ``` ### Wrong Output / Garbage Text **Problem**: Model produces nonsensical output after quantization. **Causes and fixes**: 1. **Poor calibration data**: Use domain-relevant data ```python calib_data = [ "Relevant examples from your use case...", ] model.quantize(tokenizer, quant_config=quant_config, calib_data=calib_data) ``` 2. **Tokenizer mismatch**: Ensure same tokenizer ```python tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) ``` 3. **Check generation config**: ```python outputs = model.generate( **inputs, max_new_tokens=200, do_sample=True, temperature=0.7, pad_token_id=tokenizer.eos_token_id ) ``` ### FlashAttention2 Incompatibility **Error**: `ValueError: Cannot use FlashAttention2 with fused modules` **Fix**: Disable one or the other: ```python # Option 1: Use fused modules (recommended for AWQ) model = AutoAWQForCausalLM.from_quantized(model_name, fuse_layers=True) # Option 2: Use FlashAttention2 without fusion from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( model_name, attn_implementation="flash_attention_2", device_map="auto" ) ``` ### AMD GPU Issues **Error**: `RuntimeError: ROCm/HIP not found` **Fix**: Use ExLlama backend for AMD: ```python from transformers import AwqConfig config = AwqConfig(bits=4, version="exllama") model = AutoModelForCausalLM.from_pretrained( model_name, quantization_config=config ) ``` ## Loading Issues ### Model Not Found **Error**: `OSError: model_name is not a valid model identifier` **Fix**: Check HuggingFace model exists: ```bash # Search AWQ models https://huggingface.co/models?library=awq # Common AWQ model providers TheBloke, teknium, Qwen, NousResearch ``` ### Safetensors Error **Error**: `safetensors_rust.SafetensorError: Error while deserializing` **Fix**: Try loading without safetensors: ```python model = AutoAWQForCausalLM.from_quantized( model_name, safetensors=False ) ``` ### Device Map Conflicts **Error**: `ValueError: You cannot use device_map with max_memory` **Fix**: Use one or the other: ```python # Auto device map model = AutoAWQForCausalLM.from_quantized(model_name, device_map="auto") # OR manual memory limits model = AutoAWQForCausalLM.from_quantized( model_name, max_memory={0: "20GB", 1: "20GB"} ) ``` ## vLLM Integration Issues ### Quantization Not Detected **Error**: vLLM loads model in FP16 instead of quantized. **Fix**: Explicitly specify quantization: ```python from vllm import LLM llm = LLM( model="TheBloke/Llama-2-7B-AWQ", quantization="awq", # Explicitly set dtype="half" ) ``` ### Marlin Kernel Error in vLLM **Error**: `RuntimeError: Marlin kernel not supported` **Fix**: Check GPU compatibility: ```python import torch print(torch.cuda.get_device_capability()) # Must be >= (8, 0) # If not supported, use GEMM llm = LLM(model="...", quantization="awq") # Uses GEMM by default ``` ## Performance Debugging ### Memory Usage Check ```python import torch def print_gpu_memory(): for i in range(torch.cuda.device_count()): allocated = torch.cuda.memory_allocated(i) / 1e9 reserved = torch.cuda.memory_reserved(i) / 1e9 print(f"GPU {i}: {allocated:.2f}GB allocated, {reserved:.2f}GB reserved") print_gpu_memory() ``` ### Profiling Inference ```python import time def benchmark_model(model, tokenizer, prompt, n_runs=5): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") # Warmup model.generate(**inputs, max_new_tokens=10) torch.cuda.synchronize() # Benchmark times = [] for _ in range(n_runs): start = time.perf_counter() outputs = model.generate(**inputs, max_new_tokens=100) torch.cuda.synchronize() times.append(time.perf_counter() - start) tokens = outputs.shape[1] - inputs.input_ids.shape[1] avg_time = sum(times) / len(times) print(f"Average: {tokens/avg_time:.2f} tokens/sec") ``` ## Getting Help 1. **Check deprecation notice**: AutoAWQ is deprecated, use llm-compressor for new projects 2. **GitHub Issues**: https://github.com/casper-hansen/AutoAWQ/issues 3. **HuggingFace Forums**: https://discuss.huggingface.co/ 4. **vLLM Discord**: For vLLM integration issues

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI