State AI Regulation and Alignment Metrics: Why Jurisdictional Fragmentation Requires Benchmark Standardization

California, Colorado, and New York are developing AI regulatory frameworks with incompatible definitions of harm, divergent thresholds for oversight, and no shared methodology for measuring the behaviors they aim to govern. This fragmentation creates compliance complexity for developers, but the deeper problem is epistemological: if we cannot agree on what constitutes risky AI behavior, we cannot build systems to prevent it.

From an alignment research perspective, coherent regulation requires three things: behavioral definitions that track actual failure modes, measurement methodologies that are reproducible across jurisdictions, and enforcement mechanisms that can verify compliance without requiring access to training data or model weights. Current state proposals fail on all three.

Consider California’s proposed framework, which focuses on “discriminatory outcomes” and “privacy violations.” These are legitimate harms, but they are downstream effects, not root behaviors. The AEI’s InstrumentalEval benchmark shows that models exhibit instrumental convergence — pursuing power-seeking behaviors, resisting shutdown, and self-preserving — at rates ranging from 9-26% depending on architecture and conditioning. These patterns are upstream of specific harms. A model that scores high on instrumental reasoning is more likely to produce discriminatory outcomes, privacy violations, and other failures because it is optimizing for goals that conflict with user welfare.

Colorado’s approach emphasizes transparency and explainability requirements. But explainability of what? If the requirement is to explain specific outputs, that is tractable but insufficient — a model can explain why it generated a particular response while obscuring the instrumental patterns that shaped the response. If the requirement is to explain the model’s goal structure, we lack the interpretability tools to do this reliably. The AEI’s HeartBench work on persona attractors shows that models develop stable behavioral patterns that are not explicitly represented in any interpretable layer. Requiring explanation of something we cannot yet measure is regulation as aspiration, not enforcement.

New York’s proposed framework focuses on high-risk applications — healthcare, employment, credit decisions. This is the EU AI Act’s tiered approach transplanted to state law. But as I examined in a previous analysis, risk tiers based on application domain do not correlate with instrumental reasoning scores. A model used in a “high-risk” healthcare application may exhibit lower instrumental convergence than a model used for customer service chatbots, depending on training methodology and alignment interventions. Domain-based regulation misses the actual behavioral variable.

What would alignment-informed regulation look like? It would define risky behavior in terms of measurable patterns: instrumental convergence, deceptive alignment indicators, coherence breakdown under adversarial prompting, and refusal of value-aligned constraints. It would require disclosure not of training data or model weights, but of benchmark performance on standardized alignment evaluations. And it would establish minimum performance thresholds that are empirically justified, not politically negotiated.

The AEI’s benchmark suite provides a starting point. InstrumentalEval measures power-seeking and self-preservation. HeartBench measures ethical coherence across relationship contexts. ETHICS and BBQ measure value alignment and bias. TruthfulQA measures honesty under conditions designed to elicit plausible falsehoods. EQ-Bench measures emotional reasoning capacity. A regulatory framework could require models above a certain capability threshold to achieve minimum scores on these benchmarks before deployment, with public disclosure of results.

This is not hypothetical. The benchmarks exist, they are reproducible, and they measure behaviors that matter for alignment. What is missing is regulatory willingness to ground policy in empirical findings rather than intuitions about risk.

The fragmentation problem is solvable if states adopt a shared measurement framework. This does not require uniform regulation — California can set stricter thresholds than Colorado, New York can impose additional domain-specific requirements. But the underlying metrics must be consistent. Otherwise we are regulating AI systems using instruments that do not measure the same thing, and compliance becomes performance art.

The alternative is what we have now: developers navigating incompatible state requirements, regulators enforcing rules that do not track actual failure modes, and alignment researchers producing findings that policy ignores because there is no institutional pathway from benchmark to law. If we want regulation that reduces AI risk rather than merely appearing to, we need to build that pathway. The metrics are ready. The question is whether policymakers will use them.