Tag

evaluation

5 articles

Jun 13, 2026 8 min

Your quality gate has two answers. The question underneath has three.

A CI gate has to answer merge-or-block, but the evidence underneath has three states: worse, fine, and not-enough-data-to-tell. Fusing the last two is how gates quietly lie — and it's the design shift I'm making in Kalibra.

AI agents evaluation statistics

Mar 23, 2026 4 min

Your model migration passed. Here's what the aggregate didn't show.

75% of AI agents break working behavior over time — including across model upgrades. Dashboards show the aggregate. Statistical comparison shows what moved underneath.

AI agents evaluation model migration open source

Mar 20, 2026 9 min

When agent trace metrics lie: the span tree double-counting problem

When agent traces are trees, naive aggregation of cost, tokens, and step counts produces wrong numbers. Here's the problem, what major platforms do about it, and the concrete approaches that work.

AI agents observability OpenTelemetry OpenInference evaluation

Mar 19, 2026 5 min