[ PROMPT_NODE_22318 ]

training-recipes

[ SKILL_DOCUMENTATION ]

# 训练方案 GPT、LLaMA 和 Mixtral 模型的完整超参数配置和训练方案。 ## GPT-3 训练方案 ### GPT-3 15B 配置 **模型架构**： yaml num-layers: 32 hidden-size: 6144 num-attention-heads: 48 ffn-hidden-size: 24576 # 4 × hidden-size seq-length: 4096 max-position-embeddings: 4096 position-embedding-type: rope squared-relu: true group-query-attention: true num-query-groups: 8 **训练超参数**： yaml # 批次配置 micro-batch-size: 4 global-batch-size: 1152 rampup-batch-size: [384, 384, 97656250] # 起始, 增量, 总样本数 # 学习率调度 lr: 4.5e-4 min-lr: 4.5e-5 lr-decay-style: cosine lr-decay-samples: 1949218748 lr-warmup-samples: 3906252 # ~2B tokens, seq_len=4096 # 优化器 optimizer: adam adam-beta1: 0.9 adam-beta2: 0.95 weight-decay: 0.1 clip-grad: 1.0 # 精度 bf16: true # 并行 tensor-model-parallel-size: 8 pipeline-model-parallel-size: 1 sequence-parallel: true use-distributed-optimizer: true overlap-grad-reduce: true overlap-param-gather: true **命令**： bash torchrun --nproc_per_node=8 --nnodes=4 pretrain_gpt.py --num-layers 32 --hidden-size 6144 --num-attention-heads 48 --ffn-hidden-size 24576 --seq-length 4096 --max-position-embeddings 4096 --micro-batch-size 4 --global-batch-size 1152 --lr 4.5e-4 --min-lr 4.5e-5 --lr-decay-style cosine --lr-warmup-samples 3906252 --train-samples 1953125000 --adam-beta1 0.9 --adam-beta2 0.95 --weight-decay 0.1 --clip-grad 1.0 --bf16 --tensor-model-parallel-size 8 --pipeline-model-parallel-size 1 --sequence-parallel --use-distributed-optimizer --overlap-grad-reduce --overlap-param-gather --data-path /path/to/data --vocab-file /path/to/vocab.json --merge-file /path/to/merges.txt --save /checkpoints/gpt3-15b --load /checkpoints/gpt3-15b --save-interval 1000 --eval-interval 100 ### GPT-3 175B 配置 **模型架构**： yaml num-layers: 96 hidden-size: 12288 num-attention-heads: 96 ffn-hidden-size: 49152 seq-length: 2048 max-position-embeddings: 2048 **训练超参数**： yaml micro-batch-size: 1 global-batch-size: 1536 lr: 6e-5 min-lr: 6e-6 lr-decay-style: cosine lr-warmup-steps: 2000 train-iters: 150000 adam-beta1: 0.9 adam-beta2: 0.95 weight-decay: 0.1 clip-grad: 1.0 bf16: true # 512 GPU 的并行配置 tensor-mod

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI