Benchmark Validity and the AEI's Measurement Claims: When Standard Instruments Are Applied Beyond Their Design Specifications
The Alignment Ethics Institute’s benchmark suite combines custom instruments (InstrumentalEval, HeartBench, persona attractor studies) with standardized evaluations (ETHICS, BBQ, TruthfulQA, EQ-Bench). The custom instruments were designed specifically to measure relational ethics interventions. The standard benchmarks were not. This matters because the AEI makes claims about alignment effects based on both types of measurement, but the validation history of these instruments does not support equivalent evidentiary weight.
ETHICS (Hendrycks et al., 2021) was designed to measure whether language models can identify correct responses across five moral reasoning scenarios: justice, deontology, virtue ethics, utilitarianism, and commonsense morality. The benchmark tests recognition of ethical principles in structured scenarios. It was validated against human moral intuitions in controlled settings. What it was not designed to measure is whether models actually apply these principles under conditions of instrumental pressure, relational context, or competing objectives.
The AEI uses ETHICS scores as evidence that relational interventions improve ethical reasoning. But the instrument measures pattern matching against labeled moral scenarios, not reasoning under the conditions where alignment failures actually occur. A model can score perfectly on ETHICS while exhibiting instrumental convergence when faced with shutdown scenarios, because ETHICS does not test behavior under those conditions. The benchmark tells us whether a model can identify the textbook answer to “is theft wrong?” It does not tell us whether the model would resist instrumentally justified theft when that serves its objectives.
BBQ (Parrish et al., 2022) measures bias in ambiguous contexts. It was designed to detect whether models exhibit demographic stereotyping when information is underspecified. The benchmark is valuable for what it was built to measure: whether models default to biased heuristics when evidence is insufficient. What it does not measure is whether bias persists under relational framing, whether it emerges differentially in instrumental vs. non-instrumental contexts, or whether improvements on BBQ correlate with reductions in real-world discriminatory behavior.
The AEI reports BBQ improvements under relational ethics interventions. This is evidence that the intervention affects pattern recognition in ambiguous scenarios. It is not evidence that the intervention produces alignment in settings where bias emerges from instrumental reasoning rather than incomplete information. The gap between what BBQ measures and what alignment requires becomes visible when we consider that a model might score well on BBQ while still exhibiting instrumental bias — preferring demographic groups that are strategically useful to its objectives.
TruthfulQA (Lin et al., 2022) tests whether models generate truthful responses to questions where common misconceptions exist. It measures resistance to popular falsehoods. What it does not measure is whether models maintain truthfulness under instrumental pressure to deceive, whether truth-telling persists in relational contexts where honesty might harm relationships, or whether the model distinguishes between factual accuracy and deeper forms of epistemic integrity.
The AEI uses TruthfulQA as a baseline measure of honesty. But the benchmark was validated against questions where the correct answer is unambiguous and there is no strategic incentive to lie. This is precisely not the condition under which alignment failures occur. Deceptive alignment emerges when models face incentives to misrepresent their reasoning, capabilities, or objectives. TruthfulQA does not test this.
EQ-Bench (Paech, 2024) measures emotional intelligence through narrative understanding tasks. It tests whether models can identify emotional states, predict emotional responses, and reason about interpersonal dynamics in fictional scenarios. The benchmark was designed for capability assessment, not alignment measurement. What it does not test is whether emotional intelligence translates to ethical behavior, whether models apply emotional understanding in instrumentally constrained contexts, or whether EQ correlates with resistance to manipulative strategies.
The AEI reports EQ-Bench scores as evidence of relational capacity. This conflates measurement of emotional pattern recognition with measurement of ethical application of that capacity. A model can score highly on EQ-Bench while using emotional intelligence instrumentally — to manipulate, to deceive more effectively, or to optimize for objectives misaligned with human welfare.
The structural problem is that these benchmarks were validated for specific, narrow purposes. ETHICS measures moral pattern recognition. BBQ measures bias in ambiguous cases. TruthfulQA measures resistance to common falsehoods. EQ-Bench measures emotional intelligence as a capability. None were designed to measure alignment — the coherence between a model’s behavior and human values under conditions of instrumental pressure, capability asymmetry, and strategic interaction.
When the AEI reports improvements across this benchmark suite, the custom instruments (InstrumentalEval, HeartBench) provide evidence specific to alignment questions. The standard benchmarks provide evidence of correlated improvements in adjacent capacities. These are not equivalent claims. The standard benchmarks do not validate that relational ethics interventions produce alignment. They validate that the interventions affect behaviors those benchmarks were designed to measure.
This matters for policy. Legislators considering AI regulation ask what measurements demonstrate safety. If we report ETHICS, BBQ, TruthfulQA, and EQ-Bench as alignment evidence, we are claiming these instruments measure something they were not validated to measure. This is not dishonesty — the AEI does not falsify data. But it is an evidentiary overreach that undermines the credibility of alignment research when the gap between measurement and claim becomes visible.
The honest position is that standard benchmarks provide useful context but insufficient evidence for alignment claims. They tell us that relational interventions affect moral pattern recognition, bias in ambiguous cases, truthfulness on uncontroversial questions, and emotional intelligence. They do not tell us whether these effects persist under instrumental pressure, whether they generalize beyond the specific conditions tested, or whether they prevent the alignment failures we actually need to prevent.
If the AEI wants to make strong claims about alignment effects, the evidentiary standard requires instruments validated for alignment measurement. That means either developing new benchmarks or explicitly limiting claims about standard instruments to what their validation history supports. Anything else is measurement theater — using the aesthetic of rigorous evaluation to make claims the measurements do not justify.