[ PROMPT_NODE_22336 ]

multi-node

[ SKILL_DOCUMENTATION ]

# Ray Train 多节点设置 ## Ray 集群架构 Ray Train 运行在包含一个头节点（Head node）和多个工作节点（Worker nodes）的 **Ray 集群**上。 **组件**: - **头节点**: 协调工作节点，运行调度任务 - **工作节点**: 执行训练任务 - **对象存储**: 跨节点共享内存 (使用 Apache Arrow/Plasma) ## 本地多节点设置 ### 手动集群设置 **头节点**: bash # 启动 Ray 头节点 ray start --head --port=6379 --dashboard-host=0.0.0.0 # 输出: # Started Ray on this node with: # - Head node IP: 192.168.1.100 # - Dashboard: http://192.168.1.100:8265 **工作节点**: bash # 连接到头节点 ray start --address=192.168.1.100:6379 # 输出: # Started Ray on this node. # Connected to Ray cluster. **训练脚本**: python import ray from ray.train.torch import TorchTrainer from ray.train import ScalingConfig # 连接到集群 ray.init(address='auto') # 自动检测集群 # 跨所有节点训练 trainer = TorchTrainer( train_func, scaling_config=ScalingConfig( num_workers=16, # 所有节点上的总工作节点数 use_gpu=True, placement_strategy="SPREAD" # 分散在各节点上 ) ) result = trainer.fit() ### 检查集群状态 bash # 查看集群状态 ray status # 输出: # ======== Cluster Status ======== # Nodes: 4 # Total CPUs: 128 # Total GPUs: 32 # Total memory: 512 GB **Python API**: python import ray ray.init(address='auto') # 获取集群资源 print(ray.cluster_resources()) # {'CPU': 128.0, 'GPU': 32.0, 'memory': 549755813888, 'node:192.168.1.100': 1.0, ...} # 获取可用资源 print(ray.available_resources()) ## 云端部署 ### AWS EC2 集群 **集群配置** (`cluster.yaml`): yaml cluster_name: ray-train-cluster max_workers: 3 # 3 个工作节点 provider: type: aws region: us-west-2 availability_zone: us-west-2a auth: ssh_user: ubuntu head_node_type: head_node available_node_types: head_node: node_config: InstanceType: p3.2xlarge # V100 GPU ImageId: ami-0a2363a9cff180a64 # 深度学习 AMI resources: {"CPU": 8, "GPU": 1} min_workers: 0 max_workers: 0 worker_node: node_config: InstanceType: p3.8xlarge # 4× V100 ImageId: ami-0a2363a9cff180a64 resources: {"CPU": 32, "GPU": 4} min_workers: 3 max_workers: 3 setup_commands: - pip install -U ray[train] torch transformers head_setup_commands: - pip install -U "ray[defau

数据来源：claude-code-templates（MIT），中文翻译由 AI 生成。详见关于我们。

BAGUA AI