Tag

open source

2 articles

Mar 23, 2026 4 min

Your model migration passed. Here's what the aggregate didn't show.

75% of AI agents break working behavior over time — including across model upgrades. Dashboards show the aggregate. Statistical comparison shows what moved underneath.

Mar 19, 2026 5 min

Aggregate metrics are a blind spot in agent evaluation

Why aggregate eval metrics hide AI agent regressions, and how statistical testing catches what aggregates miss.

AI agents evaluation open source testing