Local-first & Privacy-focused

Evaluation framework
for evolving agents

Fully local with SQLite. Auto-capture LLM calls with one import. 50+ built-in metrics. GEPA-powered calibration. One command to run it all.

1 2 3 4 5 6 7 8
# Add @eval to auto-trace LLM calls to SQLite
from evalyn_sdk import eval
@eval(project="myproj", version="v1")
def my_agent(user_input: str) -> str:
response = call_llm(user_input)
return response
# All LLM calls auto-captured with OpenTelemetry
# List recent traced LLM calls from local SQLite
$ evalyn list-calls --project myproj --limit 5
id | function | project | version | status | duration_ms
----------------------------------------------------------------
a3f2.. | my_agent | myproj | v1 | OK | 142.35
b7c1.. | my_agent | myproj | v1 | OK | 89.21
d4e8.. | my_agent | myproj | v1 | OK | 210.54
Hint: evalyn show-call --id a3f2..
# Auto-suggest metrics from 50+ built-in templates
$ evalyn suggest-metrics --project myproj --mode basic
Found 24 traces for project 'myproj'
- latency_ms [objective] :: Measure execution latency
- llm_call_count [objective] :: Count LLM API calls
- response_relevance [subjective] :: Answer accuracy
- helpfulness_accuracy [subjective] :: Helpful response
Hint: evalyn run-eval --dataset data/myproj/
# Run evaluation on dataset with selected metrics
$ evalyn run-eval --dataset data/myproj/
Results:
Metric Type Count Avg Score Pass Rate
--------------------------------------------------------------
latency_ms [obj] 50 142.30 -
response_relevance [sub] 50 0.87 87%
helpfulness_accuracy [sub] 50 0.92 92%
--------------------------------------------------------------
# Full 7-step pipeline with one command
$ evalyn one-click --project myproj
[1/7] Building Dataset
[2/7] Suggesting Metrics
[3/7] Running Initial Evaluation
[4/7] Human Annotation (interactive)
[5/7] Calibrate LLM Judges (GEPA)
[6/7] Re-evaluate with Calibrated Prompts
Pipeline complete! data/myproj-20250113/
01

Fully Local

All data stays on your machine. SQLite storage with zero cloud dependencies.

02

50+ Metrics

Built-in metric bank with 30 objective and 22 LLM judge metrics.

03

Auto Calibration

Align LLM judges with human feedback through automatic prompt optimization.

04

One Command

Run the entire evaluation pipeline with a single CLI invocation.

The Evaluation Pipeline

From trace collection to continuous improvement.

01 Collect

Auto-capture traces with simple decorators and OpenTelemetry integration. Zero config needed.

02 Evaluate

Smart metric suggestion based on your code. Choose from 50+ built-in objective metrics and LLM judges.

03 Annotate

Human-in-the-loop feedback at both trace and span level. Build ground truth from real usage.

04 Calibrate

GEPA algorithm automatically optimizes LLM judge prompts to align with human preferences.

05 Simulate

Generate synthetic datasets and run agent simulations at scale to expand evaluation coverage.