Tag

AI

2 articles

How to detect benchmark contamination in LLMs

A model scores 92% on MMLU — but did it learn the concepts or memorize the answers? Four detection strategies, from first principles.

LLM-as-a-judge: how to evaluate AI without fooling yourself

LLM-as-a-judge from first principles — when to use it, how to design rubrics, the three biases that skew scores, and when to use something simpler.