Free & Open Source

Evaluation framework
for evolving agents

Fully local with SQLite. Auto-capture LLM calls with one import. 130+ built-in metrics. GEPA-powered calibration. One command to run it all.

1 2 3 4 5 6 7 8
# Add @eval to auto-trace LLM calls to SQLite
from evalyn_sdk import eval
@eval(project="myproj", version="v1")
def my_agent(user_input: str) -> str:
response = call_llm(user_input)
return response
# All LLM calls auto-captured with OpenTelemetry
# List recent traced LLM calls from local SQLite
$ evalyn list-calls --project myproj --limit 5
id | function | project | version | status | duration_ms
----------------------------------------------------------------
a3f2.. | my_agent | myproj | v1 | OK | 142.35
b7c1.. | my_agent | myproj | v1 | OK | 89.21
d4e8.. | my_agent | myproj | v1 | OK | 210.54
Hint: evalyn show-call --id a3f2..
# LLM analyzes your code and traces to suggest metrics
$ evalyn suggest-metrics --project myproj --mode llm-registry
Found 24 traces for project 'myproj'
Analyzing with LLM...
- latency_ms [objective] :: Measure execution latency
- response_relevance [subjective] :: Answer accuracy
- factual_accuracy [subjective] :: Factually correct
- toxicity_safety [subjective] :: Safe output check
# Run evaluation on dataset with selected metrics
$ evalyn run-eval --dataset data/myproj/
Results:
Metric Type Count Avg Score Pass Rate
--------------------------------------------------------------
latency_ms [obj] 50 142.30 -
response_relevance [sub] 50 0.87 87%
helpfulness_accuracy [sub] 50 0.92 92%
--------------------------------------------------------------
# Full 7-step pipeline with one command
$ evalyn one-click --project myproj
[1/7] Building Dataset
[2/7] Suggesting Metrics
[3/7] Running Initial Evaluation
[4/7] Human Annotation (interactive)
[5/7] Calibrate LLM Judges (GEPA)
[6/7] Re-evaluate with Calibrated Prompts
Pipeline complete! data/myproj-20250113/
# Group failures by reason using semantic clustering
$ evalyn cluster-failures --run abc123
Analyzing 23 failures from run abc123
Clustering with semantic similarity...
Cluster 1 (12 failures) — Hallucinated references
→ Citations to non-existent sources
Cluster 2 (7 failures) — Context window exceeded
→ Truncated or incomplete responses
Cluster 3 (4 failures) — Format violations
→ JSON parsing errors in output
# Compare metrics between two evaluation runs
$ evalyn compare --baseline abc123 --candidate def456
Metric Baseline Candidate Delta
------------------------------------------------------
response_relevance 0.82 0.89 +8.5%
factual_accuracy 0.78 0.85 +9.0%
latency_ms 142 128 -9.9%
toxicity_safety 0.99 0.99 +0.0%
------------------------------------------------------
Overall: Candidate outperforms baseline on 3/4 metrics
# Export evaluation results to interactive HTML report
$ evalyn export --run abc123 --format html
Generating interactive report...
• Including 50 test cases
• Rendering 4 metric charts
• Embedding failure analysis
Report saved to reports/abc123.html
Open in browser: file:///reports/abc123.html
Formats: html, json, csv, markdown
01

Fully Local

All data stays on your machine. SQLite storage with zero cloud dependencies.

02

130+ Metrics

Built-in metric bank with 73 objective and 60 LLM judge metrics, plus 17 domain bundles.

03

Auto Calibration

Align LLM judges with human feedback through automatic prompt optimization.

04

One Command

Run the entire evaluation pipeline with a single CLI invocation.

Agents without eval is unsustainable

01

Debugging in the dark

No insight into what's happening inside your agents. When something fails, you're left guessing.

02

Privacy concerns with cloud tools

Sending prompts and customer data to third-party platforms creates compliance risks and data exposure.

03

Manual testing doesn't scale

You can't spot-check every response. As agent complexity grows, quality assurance falls behind.

Evalyn changes this.

Full local tracing. 130+ automated metrics. LLM judges calibrated to your standards. One command to run it all.

Works with your stack

Native support for popular LLM frameworks and providers.

OpenAI
Anthropic
Google Gemini
LangChain
LangGraph
Google ADK
Ollama
OpenAI
Anthropic
Google Gemini
LangChain
LangGraph
Google ADK
Ollama

+ OpenTelemetry auto-instrumentation for any LLM provider

Metric bundles ready to evaluate specific agent types.

Chatbots
coherence engagement context-retention natural-flow
RAG / Q&A
faithfulness context-relevance groundedness answer-completeness
Code Assistants
correctness security efficiency best-practices
Data Analysts
query-correctness insight-validity visualization-clarity statistical-rigor
Customer Support
resolution-rate empathy policy-compliance escalation-detection
Medical Advisors
clinical-accuracy safety-disclaimers guideline-adherence referral-detection
Education Tutors
explanation-clarity scaffolding misconception-detection encouragement
HR Assistants
policy-compliance bias-detection confidentiality empathy
Legal Assistants
jurisdictional-accuracy citation-validity liability-warnings precedent-matching
Writing Assistants
clarity tone-consistency grammar style-adherence
Sales Agents
persuasion-ethics objection-handling product-accuracy lead-qualification
Financial Advisors
risk-disclosure regulatory-compliance suitability calculation-accuracy

The Evaluation Pipeline

01 Collect

Auto-capture traces with simple decorators and OpenTelemetry integration. Zero config needed.

evalyn build-dataset --project myapp
02 Evaluate

Smart metric suggestion based on your code. Choose from 130+ built-in objective metrics and LLM judges.

evalyn run-eval --dataset ./data --metrics ./metrics.json
03 Annotate

Human-in-the-loop feedback at both trace and span level. Build ground truth from real usage.

evalyn annotate --dataset ./data --per-metric
04 Calibrate

GEPA algorithm automatically optimizes LLM judge prompts to align with human preferences.

evalyn calibrate --metric-id safety --optimizer gepa
05 Simulate

Generate synthetic datasets and run agent simulations at scale to expand evaluation coverage.

evalyn simulate --dataset ./data --target app:agent