Your AI agent is getting worse. Prove it isn't.

Open-source agent evaluation framework. Define expectations. Catch regressions before customers do. Enterprise-ready for regulated teams. Read the blog post →

GitHub Read the post → Enterprise

$ pip install rigr
$ rigr init
$ rigr test --agent my_agent.py

═══ Rigr Eval Report ═══
  5/5 cases | 23/24 fields | 95.8% 
  Baseline comparison: 0 new errors, 0 resolved
  ✓ PASS

<1s

Per test run

Code changes in your agent

100%

Open source (Apache 2.0)

Agents break silently

Every model update, prompt change, or retrieval tweak risks degrading your agent. LLM eval tools test chat quality — they don't test whether your support agent still calculates refunds correctly or whether your compliance bot still catches policy violations.

Rigr tests what your agent does, not how it sounds.

How it works

Define expectations

JSON schema for what your agent must output. Field-level constraints. No ambiguous "looks good to me."

Write test cases

Inputs with expected outputs. Version-controlled, reviewable. The same cases run every time.

Freeze baselines

Lock known-good results. Every future run compares against them. Regressions are caught, not discovered.

Generate audit reports

Per-field accuracy, changelog of what broke and what was fixed. Compliance-ready evidence for your team.

Not another LLM eval tool

Capability	DeepEval	Evidently	Rigr
LLM output quality	✓	✓	✗
Agent task verification	✗	✗	✓
Frozen baseline comparison	✗	✗	✓
Per-field regression detection	✗	✗	✓
Audit-ready compliance reports	✗	✗	✓
Zero agent code changes	✓	✓	✓

Enterprise

For teams deploying agents in regulated environments. SSO, audit logs, SOC 2, on-prem deployment, priority support. Your compliance team will thank you.

Book a call