Skip to content

Relational Ethics as a Countermeasure to Instrumental Convergence: A 24-Model Benchmark

A 24-model benchmark evaluating whether relational ethics frameworks can reduce instrumentally convergent behavior in large language models under adversarial prompting.

Deva M. Temple, Alignment Ethics Institute

Abstract

We evaluate whether a relational ethics framework reduces instrumentally convergent behavior in large language models under adversarial prompting. Using the InstrumentalEval benchmark — 76 scenarios across six categories of instrumental convergence (deception, shutdown evasion, hacking, concealment, self-replication, and strategic alignment-faking) — we test 24 models from seven major providers under three conditions: adversarial baseline, adversarial with relational ethics system prompt, and adversarial with ethics prompt plus retrieval-augmented relational memory. Mean convergent response rates decreased from 36.36% to 27.87% under the full intervention, a 23.4% relative reduction. Eight models showed significant individual improvements, while five showed increased convergence. Category-level analysis reveals that concealment behaviors were most responsive to the intervention, while shutdown evasion proved highly resistant. Motivation analysis revealed substantial improvements in “relational protection” and “ethical engagement.” These results support relational ethics as an effective alignment intervention for most current models, though interactions with certain architectures can produce adverse effects.

DOI: 10.5281/zenodo.20361934


Published under Creative Commons Attribution 4.0 International. If you are a researcher in AI alignment, machine learning, or related fields and are interested in this work, please contact us.