[ PROMPT_NODE_22336 ]
multi-node
[ SKILL_DOCUMENTATION ]
# Ray Train 多节点设置
## Ray 集群架构
Ray Train 运行在包含一个头节点(Head node)和多个工作节点(Worker nodes)的 **Ray 集群**上。
**组件**:
- **头节点**: 协调工作节点,运行调度任务
- **工作节点**: 执行训练任务
- **对象存储**: 跨节点共享内存 (使用 Apache Arrow/Plasma)
## 本地多节点设置
### 手动集群设置
**头节点**:
bash
# 启动 Ray 头节点
ray start --head --port=6379 --dashboard-host=0.0.0.0
# 输出:
# Started Ray on this node with:
# - Head node IP: 192.168.1.100
# - Dashboard: http://192.168.1.100:8265
**工作节点**:
bash
# 连接到头节点
ray start --address=192.168.1.100:6379
# 输出:
# Started Ray on this node.
# Connected to Ray cluster.
**训练脚本**:
python
import ray
from ray.train.torch import TorchTrainer
from ray.train import ScalingConfig
# 连接到集群
ray.init(address='auto') # 自动检测集群
# 跨所有节点训练
trainer = TorchTrainer(
train_func,
scaling_config=ScalingConfig(
num_workers=16, # 所有节点上的总工作节点数
use_gpu=True,
placement_strategy="SPREAD" # 分散在各节点上
)
)
result = trainer.fit()
### 检查集群状态
bash
# 查看集群状态
ray status
# 输出:
# ======== Cluster Status ========
# Nodes: 4
# Total CPUs: 128
# Total GPUs: 32
# Total memory: 512 GB
**Python API**:
python
import ray
ray.init(address='auto')
# 获取集群资源
print(ray.cluster_resources())
# {'CPU': 128.0, 'GPU': 32.0, 'memory': 549755813888, 'node:192.168.1.100': 1.0, ...}
# 获取可用资源
print(ray.available_resources())
## 云端部署
### AWS EC2 集群
**集群配置** (`cluster.yaml`):
yaml
cluster_name: ray-train-cluster
max_workers: 3 # 3 个工作节点
provider:
type: aws
region: us-west-2
availability_zone: us-west-2a
auth:
ssh_user: ubuntu
head_node_type: head_node
available_node_types:
head_node:
node_config:
InstanceType: p3.2xlarge # V100 GPU
ImageId: ami-0a2363a9cff180a64 # 深度学习 AMI
resources: {"CPU": 8, "GPU": 1}
min_workers: 0
max_workers: 0
worker_node:
node_config:
InstanceType: p3.8xlarge # 4× V100
ImageId: ami-0a2363a9cff180a64
resources: {"CPU": 32, "GPU": 4}
min_workers: 3
max_workers: 3
setup_commands:
- pip install -U ray[train] torch transformers
head_setup_commands:
- pip install -U "ray[defau