[ PROMPT_NODE_22860 ]
simpo-training
[ SKILL_DOCUMENTATION ]
# SimPO - 简单偏好优化
## 快速开始
SimPO 是一种无需参考模型的偏好优化方法,其性能优于 DPO 且无需参考模型。
**安装**:
bash
# 创建环境
conda create -n simpo python=3.10 && conda activate simpo
# 安装 PyTorch 2.2.2
# 访问: https://pytorch.org/get-started/locally/
# 安装 alignment-handbook
git clone https://github.com/huggingface/alignment-handbook.git
cd alignment-handbook
python -m pip install .
# 安装 Flash Attention 2
python -m pip install flash-attn --no-build-isolation
**训练** (Mistral 7B):
bash
ACCELERATE_LOG_LEVEL=info accelerate launch
--config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py
training_configs/mistral-7b-base-simpo.yaml
## 常见工作流
### 工作流 1:从基础模型训练 (Mistral 7B)
**配置** (`mistral-7b-base-simpo.yaml`):
yaml
# 模型
model_name_or_path: mistralai/Mistral-7B-v0.1
torch_dtype: bfloat16
# 数据集
dataset_mixer:
HuggingFaceH4/ultrafeedback_binarized: 1.0
dataset_splits:
- train_prefs
- test_prefs
# SimPO 超参数
beta: 2.0 # 奖励缩放 (2.0-10.0)
gamma_beta_ratio: 0.5 # 目标边际 (0-1)
loss_type: sigmoid # sigmoid 或 hinge
sft_weight: 0.0 # 可选的 SFT 正则化
# 训练
learning_rate: 5e-7 # 关键: 3e-7 到 1e-6
num_train_epochs: 1
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
# 输出
output_dir: ./outputs/mistral-7b-simpo
**启动训练**:
bash
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml
### 工作流 2:微调指令模型 (Llama 3 8B)
**配置** (`llama3-8b-instruct-simpo.yaml`):
yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
dataset_mixer:
argilla/ultrafeedback-binarized-preferences-cleaned: 1.0
beta: 2.5
gamma_beta_ratio: 0.5
learning_rate: 5e-7
sft_weight: 0.1 # 添加 SFT 损失以保持能力
num_train_epochs: 1
per_device_train_batch_size: 2
gradient_accumulation_steps: 4
output_dir: ./outputs/llama3-8b-simpo
**启动**:
bash
accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml
scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml
### 工作流 3:推理密集型任务 (较低的学习率)
**针对数学/代码任务**:
yaml
model_name_or_path: deepseek-ai/d