[ PROMPT_NODE_22399 ]
Custom Benchmarks
[ SKILL_DOCUMENTATION ]
# Custom Benchmark Integration
NeMo Evaluator supports adding custom benchmarks through Framework Definition Files (FDFs) and custom containers.
## Overview
Custom benchmarks are added by:
1. **Framework Definition Files (FDFs)**: YAML files that define evaluation tasks, commands, and output parsing
2. **Custom Containers**: Package your framework with nemo-evaluator for reproducible execution
> **Note**: NeMo Evaluator does not currently support programmatic harness APIs or custom metric implementations via Python classes. Customization is done through FDFs and containers.
## Framework Definition Files (FDFs)
FDFs are the primary way to add custom evaluations. An FDF declares framework metadata, default commands, and evaluation tasks.
### FDF Structure
```yaml
# framework_def.yaml
framework:
name: my-custom-framework
package_name: my_custom_eval
defaults:
command: "python -m my_custom_eval.run --model-id {model_id} --task {task} --output-dir {output_dir}"
evaluations:
- name: custom_task_1
defaults:
temperature: 0.0
max_new_tokens: 512
extra:
custom_param: value
- name: custom_task_2
defaults:
temperature: 0.7
max_new_tokens: 1024
```
### Key FDF Components
**Framework section**:
- `name`: Human-readable name for your framework
- `package_name`: Python package name
**Defaults section**:
- `command`: The command template to execute your evaluation
- Placeholders: `{model_id}`, `{task}`, `{output_dir}` are substituted at runtime
**Evaluations section**:
- List of tasks with their default parameters
- Each task can override the framework defaults
### Output Parser
When creating a custom FDF, you need an output parser function that translates your framework's results into NeMo Evaluator's standard schema:
```python
# my_custom_eval/parser.py
def parse_output(output_dir: str) -> dict:
"""
Parse evaluation results from output_dir.
Returns dict with metrics in NeMo Evaluator format.
"""
# Read your framework's output files
results_file = Path(output_dir) / "results.json"
with open(results_file) as f:
raw_results = json.load(f)
# Transform to standard schema
return {
"metrics": {
"accuracy": raw_results["score"],
"total_samples": raw_results["num_samples"]
}
}
```
## Custom Container Creation
Package your custom framework as a container for reproducibility.
### Dockerfile Example
```dockerfile
# Dockerfile
FROM python:3.10-slim
# Install nemo-evaluator
RUN pip install nemo-evaluator
# Install your custom framework
COPY my_custom_eval/ /opt/my_custom_eval/
RUN pip install /opt/my_custom_eval/
# Copy framework definition
COPY framework_def.yaml /opt/framework_def.yaml
# Set working directory
WORKDIR /opt
ENTRYPOINT ["python", "-m", "nemo_evaluator"]
```
### Build and Push
```bash
docker build -t my-registry/custom-eval:1.0 .
docker push my-registry/custom-eval:1.0
```
### Register in mapping.toml
Add your custom container to the task registry:
```toml
# Add to mapping.toml
[my-custom-framework]
container = "my-registry/custom-eval:1.0"
[my-custom-framework.tasks.chat.custom_task_1]
required_env_vars = []
[my-custom-framework.tasks.chat.custom_task_2]
required_env_vars = ["CUSTOM_API_KEY"]
```
## Using Custom Datasets
### Dataset Mounting
Mount proprietary datasets at runtime rather than baking them into containers:
```yaml
# config.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./results
evaluation:
tasks:
- name: custom_task_1
dataset_dir: /path/to/local/data
dataset_mount_path: /data # Optional, defaults to /datasets
```
The launcher will mount the dataset directory into the container and set `NEMO_EVALUATOR_DATASET_DIR` environment variable.
### Task-Specific Environment Variables
Pass environment variables to specific tasks:
```yaml
evaluation:
tasks:
- name: gpqa_diamond
env_vars:
HF_TOKEN: HF_TOKEN # Maps to $HF_TOKEN from host
- name: custom_task
env_vars:
CUSTOM_API_KEY: MY_CUSTOM_KEY
DATA_PATH: /data/custom.jsonl
```
## Parameter Overrides
Override evaluation parameters at multiple levels:
### Global Overrides
Apply to all tasks:
```yaml
evaluation:
nemo_evaluator_config:
config:
params:
temperature: 0.0
max_new_tokens: 512
parallelism: 4
request_timeout: 300
```
### Task-Specific Overrides
Override for individual tasks:
```yaml
evaluation:
tasks:
- name: humaneval
nemo_evaluator_config:
config:
params:
temperature: 0.8
max_new_tokens: 1024
n_samples: 200 # Task-specific parameter
```
### CLI Overrides
Override at runtime:
```bash
nemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10
```
## Testing Custom Benchmarks
### Dry Run
Validate configuration without execution:
```bash
nemo-evaluator-launcher run
--config-dir .
--config-name custom_config
--dry-run
```
### Limited Sample Testing
Test with a small subset first:
```bash
nemo-evaluator-launcher run
--config-dir .
--config-name custom_config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=5
```
### Check Results
```bash
# View results
cat results///artifacts/results.json
# Check logs
cat results///artifacts/logs/eval.log
```
## Best Practices
1. **Use FDFs**: Define custom benchmarks via Framework Definition Files
2. **Containerize**: Package frameworks as containers for reproducibility
3. **Mount data**: Use volume mounts for datasets instead of baking into images
4. **Test incrementally**: Use `limit_samples` for quick validation
5. **Version containers**: Tag containers with semantic versions
6. **Document parameters**: Include clear documentation in your FDF
## Limitations
Currently **not supported**:
- Custom Python metric classes via plugin system
- Programmatic harness registration via Python API
- Runtime metric injection via configuration
Custom scoring logic must be implemented within your evaluation framework and exposed through the FDF's output parser.
## Example: Complete Custom Setup
```yaml
# custom_eval_config.yaml
defaults:
- execution: local
- deployment: none
- _self_
execution:
output_dir: ./custom_results
target:
api_endpoint:
model_id: my-model
url: http://localhost:8000/v1/chat/completions
api_key_name: ""
evaluation:
nemo_evaluator_config:
config:
params:
parallelism: 4
request_timeout: 300
tasks:
- name: custom_task_1
dataset_dir: /data/benchmarks
env_vars:
DATA_VERSION: v2
nemo_evaluator_config:
config:
params:
temperature: 0.0
max_new_tokens: 256
```
Run with:
```bash
nemo-evaluator-launcher run
--config-dir .
--config-name custom_eval_config
```
Source: claude-code-templates (MIT). See About Us for full credits.