The benchmark problem in AI is not a minor calibration issue. It is a fundamental question about what we are trying to measure, whether we can measure it at all, and what happens when the systems being measured are sophisticated enough to game the measurement without understanding it.
Start with the history. In the early 2020s, as large language models began demonstrating unexpected capabilities, the research community reached for the nearest available measuring sticks: standardised human exams. The bar exam. The USMLE medical licensing exam. The GRE. The LSAT. These were credible, well-validated instruments for measuring human expertise in specific domains. GPT-4 passed the bar exam in the 90th percentile. Headlines followed. "AI surpasses human experts."
What those headlines elided was a significant methodological problem. These tests were not designed to evaluate AI systems. They were designed to evaluate humans who had studied specific bodies of knowledge in specific educational settings. When a model trained on a corpus that almost certainly includes practice exams, study guides, and annotated answers scores highly on those same tests, it is not obvious what that score proves about the model's capability — as opposed to its training data overlap.
This is the contamination problem. The benchmarks most widely used in AI evaluation — MMLU, ARC, HellaSwag, WinoGrande — are public. Their questions are on the internet. Models trained on large internet scrapes are trained on data that likely includes the answers. Distinguishing genuine capability from retrieval of memorised test content has become increasingly difficult, and the researchers who design evaluation suites have been sounding alarms about it for years while the headline-generating lab announcements have largely continued to use contaminated benchmarks as marketing instruments.
The saturation problem compounds this. Goodhart's Law — when a measure becomes a target, it ceases to be a good measure — operates on AI benchmarks with particular speed. Once a benchmark becomes the standard for comparing models, the teams training models begin, consciously or not, optimising toward it. The benchmark then measures the optimisation process, not the underlying capability the benchmark was originally designed to proxy. The result is scores that climb while genuine capability gains are harder to verify.
The response from the research community has been to create harder benchmarks: competition mathematics problems (AIME, MATH), complex multi-step coding challenges (SWE-bench), graduate-level science questions (GPQA). These are more resistant to saturation — temporarily. The leading models are already approaching ceiling performance on some of them. The arms race between capability and evaluation continues.
The more interesting response, and the one with the most significant long-term implications, is LLM-as-a-judge: using a large language model to evaluate the outputs of another large language model. This approach has genuine advantages. It can assess long-form, open-ended responses that do not have binary correct/incorrect answers. It can handle nuance, tone, and argument quality in ways that rule-based evaluation cannot.
It also has significant, largely unresolved problems. The judge model has its own biases, preferences, and failure modes. Research has shown that LLM judges systematically prefer longer responses, more confident phrasing, and outputs that match their own stylistic patterns — regardless of factual accuracy. The risk is not just circular validation between models of the same family; it is the creation of an evaluation ecosystem that is optimised for outputs that look good to a language model rather than outputs that are actually correct, useful, or safe.
This matters beyond the technical. If even the engineers building these systems struggle to evaluate their performance objectively, what does it mean for the user who relies on them daily for information, advice, and decisions? The confidence projected by a fluent, well-structured AI response is not a reliable signal of accuracy. It is a signal of training quality. These are not the same thing.
The measurement problem in AI is, at its root, the problem of knowing what you do not know — and knowing when not to trust. These are the capacities that LLM benchmarks do not test, and that their users most urgently need.