[ PROMPT_NODE_22877 ]

Post Training Torchforge – Troubleshooting

[ SKILL_DOCUMENTATION ]

# torchforge Troubleshooting Guide ## GPU Resource Issues ### Issue: Not Enough GPUs **Symptoms**: "Insufficient GPU resources" error **Solutions**: 1. **Reduce service requirements**: ```yaml services: generator: procs: 1 with_gpus: true trainer: procs: 1 with_gpus: true # Remove ref_model or use CPU ``` 2. **Use CPU for reference model**: ```yaml ref_model: with_gpus: false # Run on CPU ``` 3. **Share resources between services**: ```yaml services: generator: procs: 1 num_replicas: 1 colocate_with: trainer # Share GPU with trainer ``` ### Issue: Minimum GPU Requirements **Reference**: - SFT: 2+ GPUs (trainer + generator) - GRPO: 3+ GPUs (trainer + generator + ref_model) - Large models: 8+ GPUs with tensor parallelism ## Memory Issues ### Issue: OOM During Generation **Symptoms**: CUDA OOM in vLLM **Solutions**: 1. **Reduce batch size**: ```yaml grpo: n_samples: 4 # Reduce from 8 ``` 2. **Reduce sequence length**: ```yaml training: seq_len: 2048 # Reduce from 4096 ``` 3. **Reduce vLLM memory**: ```yaml generator: gpu_memory_utilization: 0.7 # Reduce from 0.9 ``` ### Issue: OOM During Training **Symptoms**: CUDA OOM in backward pass **Solutions**: 1. **Enable gradient checkpointing**: ```yaml training: gradient_checkpointing: true ``` 2. **Increase gradient accumulation**: ```yaml training: gradient_accumulation_steps: 8 # Increase from 4 ``` 3. **Reduce batch size**: ```yaml training: batch_size: 2 # Reduce from 4 ``` ## Weight Synchronization Issues ### Issue: Slow Weight Sync **Symptoms**: Long pauses between training and generation **Solutions**: 1. **Enable RDMA** (if available): ```bash export TORCHSTORE_USE_RDMA=1 ``` 2. **Reduce sync frequency**: ```yaml training: sync_interval: 10 # Sync every 10 steps ``` 3. **Use colocated services**: ```yaml services: generator: colocate_with: trainer ``` ### Issue: Weight Sync Failures **Symptoms**: Errors in weight transfer, stale weights **Solutions**: 1. **Check network connectivity**: ```bash ping other_node ``` 2. **Increase timeout**: ```yaml services: weight_sync_timeout: 600 # 10 minutes ``` 3. **Enable sync verification**: ```yaml training: verify_weight_sync: true ``` ## Training Stability Issues ### Issue: Policy Collapse **Symptoms**: Entropy drops to zero, reward stops improving **Solutions**: 1. **Increase KL penalty**: ```yaml grpo: beta: 0.2 # Increase from 0.1 ``` 2. **Add entropy bonus**: ```yaml training: entropy_coef: 0.01 ``` 3. **Reduce learning rate**: ```yaml training: learning_rate: 5e-7 # Reduce from 1e-6 ``` ### Issue: Loss Spikes **Symptoms**: Sudden loss increases, training instability **Solutions**: 1. **Enable gradient clipping**: ```yaml training: max_grad_norm: 1.0 ``` 2. **Reduce clip range**: ```yaml grpo: clip_low: 0.1 # Reduce from 0.2 clip_high: 0.18 # Reduce from 0.28 ``` 3. **Use learning rate warmup**: ```yaml training: warmup_steps: 100 ``` ### Issue: Divergent Training **Symptoms**: Loss becomes NaN, model outputs garbage **Solutions**: 1. **Check for data issues**: ```python # Verify no empty sequences for batch in dataset: assert batch.input_ids.numel() > 0 ``` 2. **Use BF16 instead of FP16**: ```yaml training: dtype: bfloat16 ``` 3. **Reduce learning rate significantly**: ```yaml training: learning_rate: 1e-7 ``` ## Service Issues ### Issue: Service Startup Failures **Symptoms**: Services fail to initialize **Solutions**: 1. **Check resource availability**: ```bash nvidia-smi # Verify GPU availability ``` 2. **Increase startup timeout**: ```yaml services: startup_timeout: 600 ``` 3. **Check model path**: ```python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("model_path") # Verify accessible ``` ### Issue: Generator Not Responding **Symptoms**: Generation hangs, timeouts **Solutions**: 1. **Check vLLM status**: ```python # Add health check await generator.health_check.route() ``` 2. **Restart service**: ```python await generator.restart.fanout() ``` 3. **Reduce concurrent requests**: ```yaml generator: max_concurrent_requests: 10 ``` ## Monarch Issues ### Issue: Monarch Actor Failures **Symptoms**: Actor crashes, communication errors **Solutions**: 1. **Enable fault tolerance**: ```yaml monarch: fault_tolerance: true max_restarts: 3 ``` 2. **Increase actor memory**: ```yaml services: actor_memory_mb: 4096 ``` 3. **Check Monarch logs**: ```bash export MONARCH_LOG_LEVEL=DEBUG ``` ### Issue: Deadlock in Distributed Communication **Symptoms**: Training hangs, no progress **Solutions**: 1. **Check for blocking calls**: ```python # Use async/await correctly result = await service.method.route(args) # Correct # result = service.method.route(args).wait() # May deadlock ``` 2. **Add timeouts**: ```python result = await asyncio.wait_for( service.method.route(args), timeout=60.0 ) ``` ## Installation Issues ### Issue: PyTorch Version Mismatch **Symptoms**: Import errors, CUDA errors **Solutions**: 1. **Use provided install script**: ```bash ./scripts/install.sh ``` 2. **Verify versions**: ```python import torch print(torch.__version__) # Should be 2.9.0+ ``` 3. **Clean reinstall**: ```bash pip uninstall torch torchvision torchaudio ./scripts/install.sh ``` ### Issue: Monarch Installation Fails **Symptoms**: Cannot import monarch **Solutions**: 1. **Install from source**: ```bash git clone https://github.com/meta-pytorch/monarch cd monarch && pip install -e . ``` 2. **Check CUDA compatibility**: ```bash nvcc --version # Should match PyTorch CUDA ``` ## Debugging Tips ### Enable Verbose Logging ```bash export FORGE_DEBUG=1 export MONARCH_LOG_LEVEL=DEBUG ``` ### Profile Services ```python # Add profiling with torch.profiler.profile() as prof: result = await trainer.train_step.route(batch) prof.export_chrome_trace("trace.json") ``` ### Monitor GPU Utilization ```bash watch -n 1 nvidia-smi ``` ### Test Services Individually ```python # Test generator completions = await generator.generate.route( prompts=["Hello"], max_tokens=10, ) print(completions[0].text) # Test trainer result = await trainer.train_step.route(dummy_batch) print(result.loss) ``` ## Experimental Warning Both Monarch and torchforge are experimental. Expect: - API changes between versions - Incomplete features - Bugs in edge cases Check Discord for latest updates and workarounds. ## Resources - GitHub Issues: https://github.com/meta-pytorch/torchforge/issues - Discord: https://discord.gg/YsTYBh6PD9 - Monarch Issues: https://github.com/meta-pytorch/monarch/issues

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI