[ PROMPT_NODE_22401 ]
Execution Backends
[ SKILL_DOCUMENTATION ]
# Execution Backends
NeMo Evaluator supports three execution backends: Local (Docker), Slurm (HPC), and Lepton (Cloud). Each backend implements the same interface but has different configuration requirements.
## Backend Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ nemo-evaluator-launcher │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ LocalExecutor │ │ SlurmExecutor │ │ LeptonExecutor│ │
│ │ (Docker) │ │ (SSH+sbatch)│ │ (Cloud API) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │ │ │ │
└───────────┼────────────────┼─────────────────┼───────────────┘
│ │ │
▼ ▼ ▼
┌─────────┐ ┌───────────┐ ┌────────────┐
│ Docker │ │ Slurm │ │ Lepton AI │
│ Engine │ │ Cluster │ │ Platform │
└─────────┘ └───────────┘ └────────────┘
```
## Local Executor (Docker)
The local executor runs evaluation containers on your local machine using Docker.
### Prerequisites
- Docker installed and running
- `docker` command available in PATH
- GPU drivers and nvidia-container-toolkit for GPU tasks
### Configuration
```yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
mode: sequential # or parallel
# Docker-specific options
docker_args:
- "--gpus=all"
- "--shm-size=16g"
# Container resource limits
memory_limit: "64g"
cpus: 8
```
### How It Works
1. Launcher reads `mapping.toml` to find container image for task
2. Creates run configuration and mounts volumes
3. Executes `docker run` via subprocess
4. Monitors stage files (`stage.pre-start`, `stage.running`, `stage.exit`)
5. Collects results from mounted output directory
### Example Usage
```bash
# Simple local evaluation
nemo-evaluator-launcher run
--config-dir .
--config-name local_config
# With GPU allocation
nemo-evaluator-launcher run
--config-dir .
--config-name local_config
-o 'execution.docker_args=["--gpus=all"]'
```
### Status Tracking
Status is tracked via file markers in the output directory:
| File | Meaning |
|------|---------|
| `stage.pre-start` | Container starting |
| `stage.running` | Evaluation in progress |
| `stage.exit` | Evaluation complete |
## Slurm Executor
The Slurm executor submits evaluation jobs to HPC clusters via SSH.
### Prerequisites
- SSH access to cluster head node
- Slurm commands available (`sbatch`, `squeue`, `sacct`)
- NGC containers accessible from compute nodes
- Shared filesystem for results
### Configuration
```yaml
defaults:
- execution: slurm
- deployment: vllm # or sglang, nim, none
- _self_
execution:
# SSH connection settings
hostname: cluster.example.com
username: myuser # Optional, uses SSH config
ssh_key_path: ~/.ssh/id_rsa
# Slurm job settings
account: my_account
partition: gpu
qos: normal
nodes: 1
gpus_per_node: 8
cpus_per_task: 32
memory: "256G"
walltime: "04:00:00"
# Output settings
output_dir: /shared/nfs/results
# Container settings
container_mounts:
- "/shared/data:/data:ro"
- "/shared/models:/models:ro"
```
### Deployment Options
When running on Slurm, you can deploy models alongside evaluation:
```yaml
# vLLM deployment
deployment:
type: vllm
checkpoint_path: /models/llama-3.1-8b
tensor_parallel_size: 4
max_model_len: 8192
gpu_memory_utilization: 0.9
# SGLang deployment
deployment:
type: sglang
checkpoint_path: /models/llama-3.1-8b
tensor_parallel_size: 4
# NVIDIA NIM deployment
deployment:
type: nim
nim_model_name: meta/llama-3.1-8b-instruct
```
### Job Submission Flow
```
┌─────────────────┐
│ Launcher CLI │
└────────┬────────┘
│ SSH
▼
┌─────────────────┐
│ Cluster Head │
│ Node │
└────────┬────────┘
│ sbatch
▼
┌─────────────────┐
│ Compute Node │
│ │
│ ┌─────────────┐ │
│ │ Deployment │ │
│ │ Container │ │
│ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ │
│ │ Evaluation │ │
│ │ Container │ │
│ └─────────────┘ │
└─────────────────┘
```
### Status Queries
The Slurm executor queries job status via `sacct`:
```bash
# Status command checks these Slurm states
sacct -j --format=JobID,State,ExitCode
# Mapped to ExecutionState:
# PENDING -> pending
# RUNNING -> running
# COMPLETED -> completed
# FAILED -> failed
# CANCELLED -> cancelled
```
### Long-Running Jobs
For long-running evaluations on Slurm, consider:
```yaml
execution:
walltime: "24:00:00" # Extended walltime
# Use caching to resume from interruptions
target:
api_endpoint:
adapter_config:
interceptors:
- name: caching
config:
cache_dir: "/shared/cache"
reuse_cached_responses: true
```
The caching interceptor helps resume interrupted evaluations by reusing previous API responses.
## Lepton Executor
The Lepton executor runs evaluations on Lepton AI's cloud platform.
### Prerequisites
- Lepton AI account
- `LEPTON_API_TOKEN` environment variable set
- `leptonai` Python package (auto-installed)
### Configuration
```yaml
defaults:
- execution: lepton
- deployment: none
- _self_
execution:
# Lepton job settings
resource_shape: gpu.a100-80g
num_replicas: 1
# Environment
env_vars:
NGC_API_KEY: NGC_API_KEY
HF_TOKEN: HF_TOKEN
```
### How It Works
1. Launcher creates Lepton job specification
2. Submits job via Lepton API
3. Optionally creates endpoint for model serving
4. Polls job status via API
5. Retrieves results when complete
### Endpoint Management
For evaluating Lepton-hosted models:
```yaml
target:
api_endpoint:
type: lepton
deployment_name: my-llama-deployment
# URL auto-generated from deployment
```
## Backend Selection Guide
| Use Case | Recommended Backend |
|----------|-------------------|
| Quick local testing | Local |
| Large-scale batch evaluation | Slurm |
| CI/CD pipeline | Local or Lepton |
| Multi-model comparison | Slurm (parallel jobs) |
| Cloud-native workflow | Lepton |
| Self-hosted model evaluation | Local or Slurm |
## Execution Database
All backends share the `ExecutionDB` for tracking jobs:
```
┌─────────────────────────────────────────────┐
│ ExecutionDB (SQLite) │
│ │
│ invocation_id │ job_id │ status │ backend │
│ ───────────────────────────────────────── │
│ inv_abc123 │ 12345 │ running │ slurm │
│ inv_def456 │ cont_1 │ done │ local │
└─────────────────────────────────────────────┘
```
Query via CLI:
```bash
# List all invocations
nemo-evaluator-launcher ls runs
# Get specific invocation
nemo-evaluator-launcher info
```
## Troubleshooting
### Local Executor
**Issue: Docker permission denied**
```bash
sudo usermod -aG docker $USER
newgrp docker
```
**Issue: GPU not available in container**
```bash
# Install nvidia-container-toolkit
sudo apt-get install nvidia-container-toolkit
sudo systemctl restart docker
```
### Slurm Executor
**Issue: SSH connection fails**
```bash
# Test SSH connection
ssh -v cluster.example.com
# Check SSH key permissions
chmod 600 ~/.ssh/id_rsa
```
**Issue: Job stuck in pending**
```bash
# Check queue status
squeue -u $USER
# Check account limits
sacctmgr show associations user=$USER
```
### Lepton Executor
**Issue: API token invalid**
```bash
# Verify token
curl -H "Authorization: Bearer $LEPTON_API_TOKEN"
https://api.lepton.ai/v1/jobs
```
**Issue: Resource shape unavailable**
```bash
# List available shapes
lepton shape list
```
Source: claude-code-templates (MIT). See About Us for full credits.