Evaluation framework
for evolving agents
Fully local with SQLite. Auto-capture LLM calls with one import. 50+ built-in metrics. GEPA-powered calibration. One command to run it all.
Fully Local
All data stays on your machine. SQLite storage with zero cloud dependencies.
50+ Metrics
Built-in metric bank with 30 objective and 22 LLM judge metrics.
Auto Calibration
Align LLM judges with human feedback through automatic prompt optimization.
One Command
Run the entire evaluation pipeline with a single CLI invocation.
The Evaluation Pipeline
From trace collection to continuous improvement.
Auto-capture traces with simple decorators and OpenTelemetry integration. Zero config needed.
Smart metric suggestion based on your code. Choose from 50+ built-in objective metrics and LLM judges.
Human-in-the-loop feedback at both trace and span level. Build ground truth from real usage.
GEPA algorithm automatically optimizes LLM judge prompts to align with human preferences.
Generate synthetic datasets and run agent simulations at scale to expand evaluation coverage.