Your AI agent is getting worse. Prove it isn't.

Open-source agent evaluation framework. Define expectations. Catch regressions before customers do. Enterprise-ready for regulated teams. Read the blog post →

GitHub Read the post → Enterprise
$ pip install rigr
$ rigr init
$ rigr test --agent my_agent.py

═══ Rigr Eval Report ═══
  5/5 cases | 23/24 fields | 95.8% 
  Baseline comparison: 0 new errors, 0 resolved
  ✓ PASS
<1s
Per test run
0
Code changes in your agent
100%
Open source (Apache 2.0)

Agents break silently

Every model update, prompt change, or retrieval tweak risks degrading your agent. LLM eval tools test chat quality — they don't test whether your support agent still calculates refunds correctly or whether your compliance bot still catches policy violations.

Rigr tests what your agent does, not how it sounds.

How it works

1

Define expectations

JSON schema for what your agent must output. Field-level constraints. No ambiguous "looks good to me."

2

Write test cases

Inputs with expected outputs. Version-controlled, reviewable. The same cases run every time.

3

Freeze baselines

Lock known-good results. Every future run compares against them. Regressions are caught, not discovered.

4

Generate audit reports

Per-field accuracy, changelog of what broke and what was fixed. Compliance-ready evidence for your team.

Not another LLM eval tool

CapabilityDeepEvalEvidentlyRigr
LLM output quality
Agent task verification
Frozen baseline comparison
Per-field regression detection
Audit-ready compliance reports
Zero agent code changes

Enterprise

For teams deploying agents in regulated environments. SSO, audit logs, SOC 2, on-prem deployment, priority support. Your compliance team will thank you.

Book a call