[ PROMPT_NODE_22507 ]

Infrastructure Skypilot – Advanced Usage

[ SKILL_DOCUMENTATION ]

# SkyPilot Advanced Usage Guide ## Multi-Cloud Strategies ### Cloud selection patterns ```yaml # Prefer specific clouds in order resources: accelerators: A100:8 any_of: - cloud: gcp region: us-central1 - cloud: aws region: us-west-2 - cloud: azure region: westus2 ``` ### Wildcard regions ```yaml resources: cloud: aws region: us-* # Any US region accelerators: A100:8 ``` ### Kubernetes + Cloud fallback ```yaml resources: accelerators: A100:8 any_of: - cloud: kubernetes - cloud: aws - cloud: gcp ``` ## Advanced Resource Configuration ### Instance type constraints ```yaml resources: instance_type: p4d.24xlarge # Specific instance # OR cpus: 32+ memory: 128+ accelerators: A100:8 ``` ### Disk configuration ```yaml resources: disk_size: 500 # GB disk_tier: best # low, medium, high, ultra, best ``` ### Network tier ```yaml resources: network_tier: best # High-performance networking ``` ## Production Managed Jobs ### Job configuration ```yaml name: production-training resources: accelerators: H100:8 use_spot: true spot_recovery: FAILOVER # Retry configuration max_restarts_on_errors: 3 ``` ### Controller scaling For large-scale deployments (hundreds of jobs): ```bash # Increase controller memory sky jobs launch --controller-resources memory=32 ``` ### Static credentials Use non-expiring credentials for controllers: ```bash # AWS: Use IAM role or long-lived access keys # GCP: Use service account JSON key # Azure: Use service principal ``` ## Advanced File Mounts ### Git repository workdir ```yaml workdir: url: https://github.com/user/repo.git ref: main # For private repos, set GIT_TOKEN env var ``` ### Multiple storage backends ```yaml file_mounts: /data/s3: source: s3://my-bucket/data mode: MOUNT /data/gcs: source: gs://my-bucket/data mode: MOUNT /outputs: name: training-outputs store: s3 mode: MOUNT_CACHED ``` ### Rsync exclude patterns ```yaml workdir: . # Use .skyignore or .gitignore for excludes ``` Create `.skyignore`: ``` __pycache__/ *.pyc .git/ .env node_modules/ ``` ## Distributed Training Patterns ### PyTorch DDP ```yaml num_nodes: 4 resources: accelerators: A100:8 run: | torchrun --nnodes=$SKYPILOT_NUM_NODES --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE --node_rank=$SKYPILOT_NODE_RANK --master_addr=$(echo "$SKYPILOT_NODE_IPS" | head -n1) --master_port=12355 train.py ``` ### DeepSpeed ```yaml num_nodes: 4 resources: accelerators: A100:8 setup: | pip install deepspeed run: | # Create hostfile echo "$SKYPILOT_NODE_IPS" | awk '{print $1 " slots=8"}' > /tmp/hostfile deepspeed --hostfile=/tmp/hostfile --num_nodes=$SKYPILOT_NUM_NODES --num_gpus=$SKYPILOT_NUM_GPUS_PER_NODE train.py --deepspeed ds_config.json ``` ### Ray Train ```yaml num_nodes: 4 resources: accelerators: A100:8 run: | # Head node starts Ray head if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then ray start --head --port=6379 # Wait for workers sleep 30 python train_ray.py else ray start --address=$(echo "$SKYPILOT_NODE_IPS" | head -n1):6379 fi ``` ## Sky Serve Advanced ### Multi-replica serving ```yaml service: readiness_probe: path: /health initial_delay_seconds: 60 period_seconds: 10 replica_policy: min_replicas: 2 max_replicas: 20 target_qps_per_replica: 5.0 upscale_delay_seconds: 60 downscale_delay_seconds: 300 load_balancing_policy: round_robin # or least_connections ``` ### Blue-green deployment ```bash # Deploy new version sky serve up -n my-service-v2 service_v2.yaml # Test new version curl https://my-service-v2.skypilot.cloud/health # Switch traffic (update DNS/load balancer) # Then terminate old version sky serve down my-service-v1 ``` ### Service with multiple accelerator options ```yaml service: replica_policy: min_replicas: 1 max_replicas: 5 resources: accelerators: L40S: 1 A100: 1 A10G: 1 any_of: - cloud: aws - cloud: gcp ``` ## Cost Optimization ### Spot instance strategies ```yaml resources: accelerators: A100:8 use_spot: true spot_recovery: FAILOVER # or FAILOVER_NO_WAIT # Always checkpoint for spot jobs file_mounts: /checkpoints: name: spot-checkpoints store: s3 mode: MOUNT_CACHED ``` ### Reserved instance hints ```yaml resources: accelerators: A100:8 # SkyPilot considers reserved instances in cost calculation ``` ### Budget constraints ```bash # Dry run to see cost estimate sky launch task.yaml --dryrun # Set max cluster cost (future feature) # sky launch task.yaml --max-cost-per-hour 50 ``` ## Kubernetes Integration ### Using existing clusters ```bash # Configure kubeconfig export KUBECONFIG=~/.kube/config # Verify sky check kubernetes ``` ### Pod configuration ```yaml resources: cloud: kubernetes accelerators: A100:1 config: kubernetes: pod_config: spec: runtimeClassName: nvidia tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" ``` ### Multi-cluster ```yaml resources: any_of: - cloud: kubernetes infra: cluster1 - cloud: kubernetes infra: cluster2 - cloud: aws ``` ## API Server Deployment ### Team setup ```bash # Start API server sky api serve --host 0.0.0.0 --port 8000 # Connect clients sky api login --endpoint https://your-server:8000 ``` ### Authentication ```bash # Create service account sky api create-service-account my-service # Use token in CI/CD export SKYPILOT_API_TOKEN=... sky launch task.yaml ``` ## Advanced CLI Patterns ### Parallel cluster operations ```bash # Launch multiple clusters in parallel for i in {1..10}; do sky launch -c cluster-$i task.yaml --detach & done wait ``` ### Batch job submission ```bash # Submit many jobs for config in configs/*.yaml; do name=$(basename $config .yaml) sky jobs launch -n $name $config done # Monitor all jobs sky jobs queue ``` ### Conditional execution ```yaml run: | # Only run on head node if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then python main.py else python worker.py fi ``` ## Environment Management ### Environment variables ```yaml envs: WANDB_PROJECT: my-project HF_TOKEN: $HF_TOKEN # Inherit from local CUDA_VISIBLE_DEVICES: "0,1,2,3" # Secrets (hidden in logs) secrets: - WANDB_API_KEY - HF_TOKEN ``` ### Config overrides ```yaml config: # Override global config jobs: controller: resources: memory: 32 ``` ## Monitoring and Observability ### Log streaming ```bash # Stream logs sky logs mycluster # Follow specific job sky logs mycluster 1 # Managed job logs sky jobs logs my-job --follow ``` ### Integration with W&B/MLflow ```yaml envs: WANDB_API_KEY: $WANDB_API_KEY WANDB_PROJECT: my-project run: | wandb login $WANDB_API_KEY python train.py --wandb ``` ## Debugging ### SSH access ```bash # SSH to head node ssh mycluster # SSH to worker node ssh mycluster-worker1 # Port forwarding ssh -L 8080:localhost:8080 mycluster ``` ### Interactive debugging ```bash # Launch interactive cluster sky launch -c debug --gpus A100:1 # SSH and debug ssh debug ``` ### Job inspection ```bash # View job queue sky queue mycluster # Cancel specific job sky cancel mycluster 1 # View job details sky logs mycluster 1 ```

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI