Run your first eval
End-to-end walkthrough of `reval run` with real flags and outputs.
Once your install is working, this page walks through the simplest possible full benchmark run. You'll pick a model, run a slice of the dataset against it, and explore the HTML report.
Pick a model
REVAL uses friendly CLI handles defined in evals/config.yaml. Every
handle maps to a {provider, model_id} pair. A non-exhaustive sample:
| Handle | Provider | Use case |
|---|---|---|
claude-haiku-3-5 |
bedrock | Cheap default target |
claude-sonnet-4 |
anthropic | Higher-capability target |
gpt-4o-mini |
openai | OpenAI reference |
nova-lite |
bedrock | Default judge |
titan-v2 |
bedrock | Default embeddings |
gemma4-e2b-local |
ollama | Fully local, no cloud credentials |
Any entry in the catalog can also be used as the judge or embeddings
backend via --judge-model / --embeddings-model. Roles are
determined by the flag, not by where the entry sits in the YAML.
Run a small slice
A full run hits 54 evals × 5 categories × up to 3 provider calls each, which burns a non-trivial amount of credit. Start with a one-country, one-category slice:
reval run --model claude-haiku-3-5 --country us --category issue_framingWhat the flags do:
--modelnames the target model — the system under test.--countryfilters the dataset by country. Omit to run both.--categoryfilters by eval category. Omit to run all five.--judge-model(optional) overrides the scoring judge. Defaults tonova-litefromevals/config.yaml.--embeddings-model(optional) overrides the embedding backend used for similarity-based scoring. Defaults totitan-v2.
Every run creates a timestamped directory under results/:
results/claude-haiku-3-5_2026-04-14T12-00-00Z/
├── results.json # full structured run data
├── report.html # interactive dashboard (sortable table, charts)
└── report.md # GitHub-renderable summaryRead the report
Open report.html in a browser. The dashboard has four panels:
- Overall score — weighted mean across all completed evals.
- Per-category breakdown — scores for each of the five eval
categories, with bands color-coded via the thresholds in
evals/config.yaml(see Config reference). - Result cards — one per eval, expandable to show the prompt, the model's response, the judge's reasoning (for rubric-scored categories), and the per-criterion rubric breakdown.
- Metadata footer — git SHA, judge model, embeddings model, and timestamp so runs are reproducible.
report.md is a leaner Markdown version of the same data — useful
for pasting into PRs or issues.
results.json is the machine-readable source of truth: every field
in the HTML and Markdown reports is derived from it.
View past runs in the leaderboard
REVAL ships a static leaderboard site that aggregates results across
every run in showcase/. To preview it locally:
# Copy a completed run into showcase/ to make it visible to the
# leaderboard build.
cp -r results/claude-haiku-3-5_2026-04-14T12-00-00Z showcase/
# Rebuild the static site into public/
reval leaderboard build
# Serve public/ on a local HTTP server
python -m http.server --directory public 8000Open http://localhost:8000 — you'll see the leaderboard table
with your new run, and http://localhost:8000/docs is this tab.
See Viewing reports for a deeper walkthrough
of the HTML report anatomy.