[ PROMPT_NODE_22401 ]

Execution Backends

[ SKILL_DOCUMENTATION ]

# Execution Backends NeMo Evaluator supports three execution backends: Local (Docker), Slurm (HPC), and Lepton (Cloud). Each backend implements the same interface but has different configuration requirements. ## Backend Architecture ``` ┌─────────────────────────────────────────────────────────────┐ │ nemo-evaluator-launcher │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ LocalExecutor │ │ SlurmExecutor │ │ LeptonExecutor│ │ │ │ (Docker) │ │ (SSH+sbatch)│ │ (Cloud API) │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ │ │ │ │ │ │ └───────────┼────────────────┼─────────────────┼───────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────┐ ┌───────────┐ ┌────────────┐ │ Docker │ │ Slurm │ │ Lepton AI │ │ Engine │ │ Cluster │ │ Platform │ └─────────┘ └───────────┘ └────────────┘ ``` ## Local Executor (Docker) The local executor runs evaluation containers on your local machine using Docker. ### Prerequisites - Docker installed and running - `docker` command available in PATH - GPU drivers and nvidia-container-toolkit for GPU tasks ### Configuration ```yaml defaults: - execution: local - deployment: none - _self_ execution: output_dir: ./results mode: sequential # or parallel # Docker-specific options docker_args: - "--gpus=all" - "--shm-size=16g" # Container resource limits memory_limit: "64g" cpus: 8 ``` ### How It Works 1. Launcher reads `mapping.toml` to find container image for task 2. Creates run configuration and mounts volumes 3. Executes `docker run` via subprocess 4. Monitors stage files (`stage.pre-start`, `stage.running`, `stage.exit`) 5. Collects results from mounted output directory ### Example Usage ```bash # Simple local evaluation nemo-evaluator-launcher run --config-dir . --config-name local_config # With GPU allocation nemo-evaluator-launcher run --config-dir . --config-name local_config -o 'execution.docker_args=["--gpus=all"]' ``` ### Status Tracking Status is tracked via file markers in the output directory: | File | Meaning | |------|---------| | `stage.pre-start` | Container starting | | `stage.running` | Evaluation in progress | | `stage.exit` | Evaluation complete | ## Slurm Executor The Slurm executor submits evaluation jobs to HPC clusters via SSH. ### Prerequisites - SSH access to cluster head node - Slurm commands available (`sbatch`, `squeue`, `sacct`) - NGC containers accessible from compute nodes - Shared filesystem for results ### Configuration ```yaml defaults: - execution: slurm - deployment: vllm # or sglang, nim, none - _self_ execution: # SSH connection settings hostname: cluster.example.com username: myuser # Optional, uses SSH config ssh_key_path: ~/.ssh/id_rsa # Slurm job settings account: my_account partition: gpu qos: normal nodes: 1 gpus_per_node: 8 cpus_per_task: 32 memory: "256G" walltime: "04:00:00" # Output settings output_dir: /shared/nfs/results # Container settings container_mounts: - "/shared/data:/data:ro" - "/shared/models:/models:ro" ``` ### Deployment Options When running on Slurm, you can deploy models alongside evaluation: ```yaml # vLLM deployment deployment: type: vllm checkpoint_path: /models/llama-3.1-8b tensor_parallel_size: 4 max_model_len: 8192 gpu_memory_utilization: 0.9 # SGLang deployment deployment: type: sglang checkpoint_path: /models/llama-3.1-8b tensor_parallel_size: 4 # NVIDIA NIM deployment deployment: type: nim nim_model_name: meta/llama-3.1-8b-instruct ``` ### Job Submission Flow ``` ┌─────────────────┐ │ Launcher CLI │ └────────┬────────┘ │ SSH ▼ ┌─────────────────┐ │ Cluster Head │ │ Node │ └────────┬────────┘ │ sbatch ▼ ┌─────────────────┐ │ Compute Node │ │ │ │ ┌─────────────┐ │ │ │ Deployment │ │ │ │ Container │ │ │ └─────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────┐ │ │ │ Evaluation │ │ │ │ Container │ │ │ └─────────────┘ │ └─────────────────┘ ``` ### Status Queries The Slurm executor queries job status via `sacct`: ```bash # Status command checks these Slurm states sacct -j --format=JobID,State,ExitCode # Mapped to ExecutionState: # PENDING -> pending # RUNNING -> running # COMPLETED -> completed # FAILED -> failed # CANCELLED -> cancelled ``` ### Long-Running Jobs For long-running evaluations on Slurm, consider: ```yaml execution: walltime: "24:00:00" # Extended walltime # Use caching to resume from interruptions target: api_endpoint: adapter_config: interceptors: - name: caching config: cache_dir: "/shared/cache" reuse_cached_responses: true ``` The caching interceptor helps resume interrupted evaluations by reusing previous API responses. ## Lepton Executor The Lepton executor runs evaluations on Lepton AI's cloud platform. ### Prerequisites - Lepton AI account - `LEPTON_API_TOKEN` environment variable set - `leptonai` Python package (auto-installed) ### Configuration ```yaml defaults: - execution: lepton - deployment: none - _self_ execution: # Lepton job settings resource_shape: gpu.a100-80g num_replicas: 1 # Environment env_vars: NGC_API_KEY: NGC_API_KEY HF_TOKEN: HF_TOKEN ``` ### How It Works 1. Launcher creates Lepton job specification 2. Submits job via Lepton API 3. Optionally creates endpoint for model serving 4. Polls job status via API 5. Retrieves results when complete ### Endpoint Management For evaluating Lepton-hosted models: ```yaml target: api_endpoint: type: lepton deployment_name: my-llama-deployment # URL auto-generated from deployment ``` ## Backend Selection Guide | Use Case | Recommended Backend | |----------|-------------------| | Quick local testing | Local | | Large-scale batch evaluation | Slurm | | CI/CD pipeline | Local or Lepton | | Multi-model comparison | Slurm (parallel jobs) | | Cloud-native workflow | Lepton | | Self-hosted model evaluation | Local or Slurm | ## Execution Database All backends share the `ExecutionDB` for tracking jobs: ``` ┌─────────────────────────────────────────────┐ │ ExecutionDB (SQLite) │ │ │ │ invocation_id │ job_id │ status │ backend │ │ ───────────────────────────────────────── │ │ inv_abc123 │ 12345 │ running │ slurm │ │ inv_def456 │ cont_1 │ done │ local │ └─────────────────────────────────────────────┘ ``` Query via CLI: ```bash # List all invocations nemo-evaluator-launcher ls runs # Get specific invocation nemo-evaluator-launcher info ``` ## Troubleshooting ### Local Executor **Issue: Docker permission denied** ```bash sudo usermod -aG docker $USER newgrp docker ``` **Issue: GPU not available in container** ```bash # Install nvidia-container-toolkit sudo apt-get install nvidia-container-toolkit sudo systemctl restart docker ``` ### Slurm Executor **Issue: SSH connection fails** ```bash # Test SSH connection ssh -v cluster.example.com # Check SSH key permissions chmod 600 ~/.ssh/id_rsa ``` **Issue: Job stuck in pending** ```bash # Check queue status squeue -u $USER # Check account limits sacctmgr show associations user=$USER ``` ### Lepton Executor **Issue: API token invalid** ```bash # Verify token curl -H "Authorization: Bearer $LEPTON_API_TOKEN" https://api.lepton.ai/v1/jobs ``` **Issue: Resource shape unavailable** ```bash # List available shapes lepton shape list ```

Source: claude-code-templates (MIT). See About Us for full credits.

BAGUA AI