[ PROMPT_NODE_22288 ]
Distributed Training Accelerate 性能
[ SKILL_DOCUMENTATION ]
# Accelerate 性能调优
## 分析 (Profiling)
### 基础分析
python
from accelerate import Accelerator
import time
accelerator = Accelerator()
# 预热
for _ in range(10):
batch = next(iter(dataloader))
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
# 分析训练循环
start = time.time()
total_batches = 100
for i, batch in enumerate(dataloader):
if i >= total_batches:
break
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
accelerator.wait_for_everyone() # 同步所有进程
elapsed = time.time() - start
# 指标
batches_per_sec = total_batches / elapsed
samples_per_sec = (total_batches * batch_size * accelerator.num_processes) / elapsed
print(f"吞吐量: {samples_per_sec:.2f} 样本/秒")
print(f"批次/秒: {batches_per_sec:.2f}")
### PyTorch Profiler 集成
python
from torch.profiler import profile, ProfilerActivity
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for i, batch in enumerate(dataloader):
if i >= 10: # 分析前 10 个批次
break
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
# 打印分析结果
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=20
))
# 导出到 Chrome 追踪
prof.export_chrome_trace("trace.json")
# 在 chrome://tracing 查看
## 内存优化
### 1. 梯度累积
**问题**: 大批次导致显存溢出 (OOM)
**解决方案**: 在微批次间累积梯度
python
accelerator = Accelerator(gradient_accumulation_steps=8)
# 有效批次 = batch_size × accumulation_steps × num_gpus
# 示例: 4 × 8 × 8 = 256
for batch in dataloader:
with accelerator.accumulate(model): # 处理累积逻辑
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
**内存节省**: 激活内存减少 8 倍(使用 8 个累积步数)
### 2. 梯度检查点 (Gradient Checkpointing)
**在模型中启用**:
python
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"gpt2",
use_cache=F