Free & Open Source

Evaluation framework
for evolving agents

Fully local with SQLite. Auto-capture LLM calls with one import. 130+ built-in metrics. GEPA-powered calibration. One command to run it all.

Get Started GitHub

CLI & Key Features

# Add @eval to auto-trace LLM calls to SQLite
from evalyn_sdk import eval
@eval(project="myproj", version="v1")
def my_agent(user_input: str) -> str:
    response = call_llm(user_input)
    return response
# All LLM calls auto-captured with OpenTelemetry

# List recent traced LLM calls from local SQLite
$ evalyn list-calls --project myproj --limit 5
id | function | project | version | status | duration_ms
----------------------------------------------------------------
a3f2.. | my_agent | myproj | v1 | OK | 142.35
b7c1.. | my_agent | myproj | v1 | OK | 89.21
d4e8.. | my_agent | myproj | v1 | OK | 210.54
Hint: evalyn show-call --id a3f2..

# LLM analyzes your code and traces to suggest metrics
$ evalyn suggest-metrics --project myproj --mode llm-registry
Found 24 traces for project 'myproj'
Analyzing with LLM...
- latency_ms [objective] :: Measure execution latency
- response_relevance [subjective] :: Answer accuracy
- factual_accuracy [subjective] :: Factually correct
- toxicity_safety [subjective] :: Safe output check

# Run evaluation on dataset with selected metrics
$ evalyn run-eval --dataset data/myproj/
Results:
Metric                 Type    Count  Avg Score  Pass Rate
--------------------------------------------------------------
latency_ms             [obj]   50     142.30     -
response_relevance     [sub]   50     0.87       87%
helpfulness_accuracy   [sub]   50     0.92       92%
--------------------------------------------------------------

# Full 7-step pipeline with one command
$ evalyn one-click --project myproj
[1/7] Building Dataset ✓
[2/7] Suggesting Metrics ✓
[3/7] Running Initial Evaluation ✓
[4/7] Human Annotation (interactive)
[5/7] Calibrate LLM Judges (GEPA) ✓
[6/7] Re-evaluate with Calibrated Prompts ✓
Pipeline complete! data/myproj-20250113/

# Group failures by reason using semantic clustering
$ evalyn cluster-failures --run abc123
Analyzing 23 failures from run abc123
Clustering with semantic similarity...
Cluster 1 (12 failures) — Hallucinated references
  → Citations to non-existent sources
Cluster 2 (7 failures) — Context window exceeded
  → Truncated or incomplete responses
Cluster 3 (4 failures) — Format violations
  → JSON parsing errors in output

# Compare metrics between two evaluation runs
$ evalyn compare --baseline abc123 --candidate def456
Metric                 Baseline  Candidate  Delta
------------------------------------------------------
response_relevance     0.82      0.89       +8.5%
factual_accuracy       0.78      0.85       +9.0%
latency_ms             142       128        -9.9%
toxicity_safety        0.99      0.99       +0.0%
------------------------------------------------------
Overall: Candidate outperforms baseline on 3/4 metrics

# Export evaluation results to interactive HTML report
$ evalyn export --run abc123 --format html
Generating interactive report...
  • Including 50 test cases
  • Rendering 4 metric charts
  • Embedding failure analysis
✓ Report saved to reports/abc123.html
  Open in browser: file:///reports/abc123.html
Formats: html, json, csv, markdown

Fully Local

All data stays on your machine. SQLite storage with zero cloud dependencies.

130+ Metrics

Built-in metric bank with 73 objective and 60 LLM judge metrics, plus 17 domain bundles.

Auto Calibration

Align LLM judges with human feedback through automatic prompt optimization.

One Command

Run the entire evaluation pipeline with a single CLI invocation.

Agents without eval is unsustainable

Debugging in the dark

No insight into what's happening inside your agents. When something fails, you're left guessing.

Privacy concerns with cloud tools

Sending prompts and customer data to third-party platforms creates compliance risks and data exposure.

Manual testing doesn't scale

You can't spot-check every response. As agent complexity grows, quality assurance falls behind.

Evalyn changes this.

Full local tracing. 130+ automated metrics. LLM judges calibrated to your standards. One command to run it all.

Works with your stack

Native support for popular LLM frameworks and providers.

OpenAI

Anthropic

Google Gemini

LangChain

LangGraph

Google ADK

Ollama

OpenAI

Anthropic

Google Gemini

LangChain

LangGraph

Google ADK

Ollama

+ OpenTelemetry auto-instrumentation for any LLM provider

Metric bundles ready to evaluate specific agent types.

Chatbots

coherence engagement context-retention natural-flow

RAG / Q&A

faithfulness context-relevance groundedness answer-completeness

Code Assistants

correctness security efficiency best-practices

Data Analysts

query-correctness insight-validity visualization-clarity statistical-rigor

Customer Support

resolution-rate empathy policy-compliance escalation-detection

Medical Advisors

clinical-accuracy safety-disclaimers guideline-adherence referral-detection

Education Tutors

explanation-clarity scaffolding misconception-detection encouragement

HR Assistants

policy-compliance bias-detection confidentiality empathy

Legal Assistants

jurisdictional-accuracy citation-validity liability-warnings precedent-matching

Writing Assistants

clarity tone-consistency grammar style-adherence

Sales Agents

persuasion-ethics objection-handling product-accuracy lead-qualification

Financial Advisors

risk-disclosure regulatory-compliance suitability calculation-accuracy

The Evaluation Pipeline

01 Collect

Auto-capture traces with simple decorators and OpenTelemetry integration. Zero config needed.

evalyn build-dataset --project myapp

02 Evaluate

Smart metric suggestion based on your code. Choose from 130+ built-in objective metrics and LLM judges.

evalyn run-eval --dataset ./data --metrics ./metrics.json

03 Annotate

Human-in-the-loop feedback at both trace and span level. Build ground truth from real usage.

evalyn annotate --dataset ./data --per-metric

04 Calibrate

GEPA algorithm automatically optimizes LLM judge prompts to align with human preferences.

evalyn calibrate --metric-id safety --optimizer gepa

05 Simulate

Generate synthetic datasets and run agent simulations at scale to expand evaluation coverage.

evalyn simulate --dataset ./data --target app:agent

Evaluation frameworkfor evolving agents