[ PROMPT_NODE_22367 ]

Emerging Techniques Moe Training – Training

[ SKILL_DOCUMENTATION ]

# MoE Training Guide Complete training guide based on DeepSpeed official documentation and production practices. ## Table of Contents - DeepSpeed MoE Setup - Training Configuration - PR-MoE (Pyramid-Residual-MoE) - Mixture-of-Students (MoS) - Hyperparameter Tuning - Production Training ## DeepSpeed MoE Setup **Source**: DeepSpeed MoE Tutorial (https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/) ### Requirements ```bash # Install DeepSpeed v0.6.0 or higher pip install deepspeed>=0.6.0 # Clone Megatron-DeepSpeed git clone https://github.com/microsoft/Megatron-DeepSpeed cd Megatron-DeepSpeed pip install -r requirements.txt ``` ### Basic MoE Configuration ```json { "train_batch_size": 256, "gradient_accumulation_steps": 1, "fp16": { "enabled": true, "loss_scale": 0, "initial_scale_power": 16 }, "moe": { "enabled": true, "num_experts": 128, "expert_parallel_size": 8, "moe_loss_coeff": 0.01, "train_capacity_factor": 1.25, "eval_capacity_factor": 2.0, "min_capacity": 4, "drop_tokens": true }, "zero_optimization": { "stage": 1 } } ``` ## Training Parameters ### Core MoE Parameters **From DeepSpeed documentation:** 1. **`--num-experts`** - Number of experts per MoE layer - Recommended: 128 experts - Range: 8-256 depending on scale 2. **`--moe-expert-parallel-size`** - Degree of expert parallelism - Distributes experts across GPUs - Example: 128 experts / 8 GPUs = 16 experts per GPU 3. **`--moe-loss-coeff`** - MoE auxiliary loss coefficient - Recommended: 0.01 - Controls load balancing strength 4. **`--moe-train-capacity-factor`** - Training capacity multiplier - Default: 1.25 - Formula: capacity = (tokens/num_experts) × capacity_factor 5. **`--moe-eval-capacity-factor`** - Evaluation capacity multiplier - Default: 2.0 (no token dropping during eval) 6. **`--moe-min-capacity`** - Minimum expert capacity - Default: 4 - Ensures each expert processes minimum tokens 7. **`--disable-moe-token-dropping`** - Remove expert capacity limits - All tokens processed (no dropping) - May increase memory usage ### Example Training Script ```bash #!/bin/bash deepspeed --num_gpus 8 pretrain_gpt_moe.py --tensor-model-parallel-size 1 --pipeline-model-parallel-size 1 --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 4 --global-batch-size 256 --train-iters 500000 --lr 0.0001 --min-lr 0.00001 --lr-decay-style cosine --lr-warmup-iters 2000 --clip-grad 1.0 --weight-decay 0.1 --num-experts 128 --moe-expert-parallel-size 8 --moe-loss-coeff 0.01 --moe-train-capacity-factor 1.25 --moe-eval-capacity-factor 2.0 --moe-min-capacity 4 --fp16 --deepspeed --deepspeed_config ds_config_moe.json --data-path /path/to/data --vocab-file /path/to/vocab.json --merge-file /path/to/merges.txt --save-interval 5000 --eval-interval 1000 --eval-iters 100 ``` ## PR-MoE: Pyramid-Residual-MoE **Source**: DeepSpeed documentation - improves parameter efficiency 3× over standard MoE ### Architecture PR-MoE uses: - Varying number of experts per layer (pyramid structure) - Residual connections between expert layers - Better parameter efficiency ### Configuration ```bash # PR-MoE specific parameters --num-experts "[128, 64, 32, 16]" # Pyramid: different experts per layer --mlp-type residual # Use residual connections --moe-expert-parallel-size 4 --moe-loss-coeff 0.01 ``` ### Full PR-MoE Training ```bash deepspeed --num_gpus 8 pretrain_gpt_moe.py --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --seq-length 2048 --max-position-embeddings 2048 --micro-batch-size 4 --global-batch-size 256 --num-experts "[128, 64, 32, 16]" # Pyramid structure --mlp-type residual # Residual MoE --moe-expert-parallel-size 4 --moe-loss-coeff 0.01 --moe-train-capacity-factor 1.25 --fp16 --deepspeed --deepspeed_config ds_config_moe.json --data-path /path/to/data --save-interval 5000 ``` **Benefits**: - 3× better parameter efficiency vs standard MoE - Fewer total parameters for same performance - Better gradient flow with residual connections ## Mixture-of-Students (MoS) **Source**: DeepSpeed documentation - knowledge distillation for MoE ### Overview MoS = MoE + Knowledge Distillation - Student: MoE model (being trained) - Teacher: Dense model (pre-trained) - Transfers knowledge from dense teacher to sparse MoE student ### Configuration ```bash # MoS parameters --mos # Enable MoS distillation --load-teacher /path/to/teacher # Teacher model checkpoint --teacher-forward # Enable teacher forward pass --teacher-model-parallel-size 1 ``` ### Full MoS Training ```bash deepspeed --num_gpus 8 pretrain_gpt_moe.py --num-layers 24 --hidden-size 1024 --num-attention-heads 16 --num-experts 128 --moe-expert-parallel-size 8 --moe-loss-coeff 0.01 --mos # Enable MoS --load-teacher /path/to/dense/teacher # Teacher checkpoint --teacher-forward --teacher-model-parallel-size 1 --fp16 --deepspeed --deepspeed_config ds_config_moe.json --data-path /path/to/data ``` ### Staged Distillation **Recommended**: Stop distillation early ```python # In training loop if iteration 2.0, increase moe_loss_coeff # Expert utilization utilized_experts = sum(count > 0 for count in expert_counts) utilization_rate = utilized_experts / num_experts # Should be close to 1.0 (all experts used) # Token dropping rate dropped_tokens = total_tokens - processed_tokens drop_rate = dropped_tokens / total_tokens # Should be low (<5%) during training ``` ## Troubleshooting ### Issue: Load Imbalance **Symptoms**: Some experts get most tokens **Solutions**: 1. Increase `moe_loss_coeff` (0.01 → 0.1) 2. Reduce `train_capacity_factor` (forces redistribution) 3. Add noise to router logits (gating network) ### Issue: High Memory Usage **Solutions**: 1. Enable ZeRO Stage 1 or 2 2. Reduce `train_capacity_factor` 3. Enable `drop_tokens` 4. Increase `moe_expert_parallel_size` ### Issue: Unstable Training **Solutions**: 1. Lower learning rate 2. Increase warmup steps 3. Use gradient clipping (`--clip-grad 1.0`) 4. Reduce router z-loss coefficient ## Resources - **DeepSpeed MoE Tutorial**: https://www.deepspeed.ai/tutorials/mixture-of-experts-nlg/ - **Megatron-DeepSpeed**: https://github.com/microsoft/Megatron-DeepSpeed - **Example Scripts**: `examples_deepspeed/MoE/`

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI