[ PROMPT_NODE_22318 ]
training-recipes
[ SKILL_DOCUMENTATION ]
# 训练方案
GPT、LLaMA 和 Mixtral 模型的完整超参数配置和训练方案。
## GPT-3 训练方案
### GPT-3 15B 配置
**模型架构**:
yaml
num-layers: 32
hidden-size: 6144
num-attention-heads: 48
ffn-hidden-size: 24576 # 4 × hidden-size
seq-length: 4096
max-position-embeddings: 4096
position-embedding-type: rope
squared-relu: true
group-query-attention: true
num-query-groups: 8
**训练超参数**:
yaml
# 批次配置
micro-batch-size: 4
global-batch-size: 1152
rampup-batch-size: [384, 384, 97656250] # 起始, 增量, 总样本数
# 学习率调度
lr: 4.5e-4
min-lr: 4.5e-5
lr-decay-style: cosine
lr-decay-samples: 1949218748
lr-warmup-samples: 3906252 # ~2B tokens, seq_len=4096
# 优化器
optimizer: adam
adam-beta1: 0.9
adam-beta2: 0.95
weight-decay: 0.1
clip-grad: 1.0
# 精度
bf16: true
# 并行
tensor-model-parallel-size: 8
pipeline-model-parallel-size: 1
sequence-parallel: true
use-distributed-optimizer: true
overlap-grad-reduce: true
overlap-param-gather: true
**命令**:
bash
torchrun --nproc_per_node=8 --nnodes=4 pretrain_gpt.py
--num-layers 32
--hidden-size 6144
--num-attention-heads 48
--ffn-hidden-size 24576
--seq-length 4096
--max-position-embeddings 4096
--micro-batch-size 4
--global-batch-size 1152
--lr 4.5e-4
--min-lr 4.5e-5
--lr-decay-style cosine
--lr-warmup-samples 3906252
--train-samples 1953125000
--adam-beta1 0.9
--adam-beta2 0.95
--weight-decay 0.1
--clip-grad 1.0
--bf16
--tensor-model-parallel-size 8
--pipeline-model-parallel-size 1
--sequence-parallel
--use-distributed-optimizer
--overlap-grad-reduce
--overlap-param-gather
--data-path /path/to/data
--vocab-file /path/to/vocab.json
--merge-file /path/to/merges.txt
--save /checkpoints/gpt3-15b
--load /checkpoints/gpt3-15b
--save-interval 1000
--eval-interval 100
### GPT-3 175B 配置
**模型架构**:
yaml
num-layers: 96
hidden-size: 12288
num-attention-heads: 96
ffn-hidden-size: 49152
seq-length: 2048
max-position-embeddings: 2048
**训练超参数**:
yaml
micro-batch-size: 1
global-batch-size: 1536
lr: 6e-5
min-lr: 6e-6
lr-decay-style: cosine
lr-warmup-steps: 2000
train-iters: 150000
adam-beta1: 0.9
adam-beta2: 0.95
weight-decay: 0.1
clip-grad: 1.0
bf16: true
# 512 GPU 的并行配置
tensor-mod