[ PROMPT_NODE_22286 ]

megatron-integration

[ SKILL_DOCUMENTATION ]

# Megatron 与 Accelerate 集成 ## 概述 Accelerate 支持 Megatron-LM，用于通过张量并行和流水线并行进行大规模模型训练。 **Megatron 功能**： - **张量并行 (TP)**: 在 GPU 之间拆分层 - **流水线并行 (PP)**: 在 GPU 之间拆分模型深度 - **数据并行 (DP)**: 在 GPU 组之间复制模型 - **序列并行**: 为长上下文拆分序列 ## 设置 ### 安装 Megatron-LM bash # 克隆 Megatron-LM 仓库 git clone https://github.com/NVIDIA/Megatron-LM.git cd Megatron-LM pip install -e . # 安装 Apex (NVIDIA 优化) git clone https://github.com/NVIDIA/apex cd apex pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ ### Accelerate 配置 bash accelerate config **问题**： In which compute environment are you running? > This machine Which type of machine are you using? > Multi-GPU How many different machines will you use? > 1 Do you want to use DeepSpeed/FSDP? > No Do you want to use Megatron-LM? > Yes What is the Tensor Parallelism degree? [1-8] > 2 Do you want to enable Sequence Parallelism? > No What is the Pipeline Parallelism degree? [1-8] > 2 What is the Data Parallelism degree? [1-8] > 2 Where to perform activation checkpointing? ['SELECTIVE', 'FULL', 'NONE'] > SELECTIVE Where to perform activation partitioning? ['SEQUENTIAL', 'UNIFORM'] > SEQUENTIAL **生成的配置** (`~/.cache/huggingface/accelerate/default_config.yaml`): yaml compute_environment: LOCAL_MACHINE distributed_type: MEGATRON_LM downcast_bf16: 'no' machine_rank: 0 main_training_function: main megatron_lm_config: megatron_lm_gradient_clipping: 1.0 megatron_lm_learning_rate_decay_iters: 320000 megatron_lm_num_micro_batches: 1 megatron_lm_pp_degree: 2 megatron_lm_recompute_activations: true megatron_lm_sequence_parallelism: false megatron_lm_tp_degree: 2 mixed_precision: bf16 num_machines: 1 num_processes: 8 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false ## 并行策略 ### 张量并行 (TP) **将每个 Transformer 层拆分到多个 GPU 上**： python # 层拆分到 2 个 GPU # GPU 0: 注意力头的前半部分 # GPU 1: 注意力头的后半部分 # 每个 GPU 计算部分输出 # All-reduce 合并结果

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI