Install
Clone, install, and sanity-check your reval setup.
REVAL targets Python 3.10 or newer. You'll need a working git
and credentials for at least one LLM provider — Bedrock, Anthropic,
OpenAI, MiniMax, or Ollama. Each provider is optional; you only need
keys for the surfaces you actually plan to use.
Clone and install
git clone https://github.com/krishnakartik1/reval
cd reval
pip install -e ".[dev]"
cp .env.example .envThe [dev] extra pulls in pytest, ruff, black, mypy, and
pre-commit — everything needed to run the full test suite and
lint gates. If you want to build the static leaderboard docs tab
locally, also install the [docs] extra:
pip install -e ".[dev,docs]"[docs] adds markdown-it-py, mdit-py-plugins, and pygments —
only required when you run reval leaderboard build against a
populated docs/ directory.
Provider credentials
Open .env and fill in keys only for the providers you plan to use:
# AWS Bedrock (default for judge + embeddings)
AWS_ACCESS_KEY_ID=...
AWS_SECRET_ACCESS_KEY=...
AWS_REGION=us-east-1
# Anthropic direct API
ANTHROPIC_API_KEY=...
# OpenAI (or OpenAI-compatible endpoints)
OPENAI_API_KEY=...
# MiniMax
MINIMAX_API_KEY=...
# Ollama runs locally — no keys needed, just `ollama serve` on 11434.The LLM judge and embeddings default to Amazon Bedrock (nova-lite
and titan-v2 in evals/config.yaml). If you don't have AWS
credentials, point the judge and embeddings at any other registered
model via the --judge-model and --embeddings-model flags when you
run the benchmark.
Verify your install
reval --help
reval list-evals --country us --category issue_framingThe first command prints the top-level CLI. The second enumerates
evals in the shipped dataset — if it returns rows without crashing,
your Python environment is wired up correctly. At this point you
haven't spent any API credit; list-evals only reads local files.
To exercise a provider end-to-end with a single prompt, run a one-eval slice:
reval run --model claude-haiku-3-5 --country us --category issue_framing --limit 1This is the smallest possible paid run (one Bedrock call for the target plus one for the judge). A green exit code proves your credentials and network are good.
Next
- Run your first eval for a full walkthrough of
the
reval runflags and what each output file contains. - Methodology explains what REVAL is actually measuring and why the scoring approach is novel.