Tag

open source

2 articles

Your model migration passed. Here's what the aggregate didn't show.

75% of AI agents break working behavior over time — including across model upgrades. Dashboards show the aggregate. Statistical comparison shows what moved underneath.

Aggregate metrics are a blind spot in agent evaluation

Why aggregate eval metrics hide AI agent regressions, and how statistical testing catches what aggregates miss.