Maybe we should have an ARC Prize for personal-values alignment?

a utopian vision + an initial benchmark sketch

Jun 24, 2026

If AI becomes increasingly powerful, how might we ensure that it continues serving everyone, and not just those at the top? Perhaps as follows…

In a hypothetical future

Members of society deploy personal AI agents that are individually trained on each individual’s values. These advocates use an identity system that ties them to their principal (perhaps a biometric-backed cryptographic delegate system, like that described in Liquid Reign1). With secure delegation of identity and values, our personal advocates are able to make decisions and positive-sum trades in a variety of arenas, including:

The market, especially via AI-enhanced mechanisms, like information-preserving negotiations (NDAIs) or market intermediaries that bundle social interdependencies
Politics, via simple election advising, or, more speculatively, direct/liquid democracy

To ensure that every member of society can safely and beneficially participate in the digital commons, the entire secure personal tech stack — value-elicitation tools, personal-alignment evals, and the training and deployment of advocates — could be funded by governments or non-profits.

Upgrades to personal advocates would pass through multiple safeguards, including global safety/corrigibility evals and individualized post-personalization evals, providing strong guarantees of ongoing value alignment within each human+AI coalition.

Why hasn’t this been built?

Consumers don’t yet rely much on autonomous AI, and so have not been significantly burned by the misalignment of current AI products — they don’t yet know what they’re missing. At the same time, deployers are wary of the costs required to research and develop secure custom-weights model products, and are unwilling to shoulder the brand reputation risks that come with personalization2. As such we are barreling towards a future of untrustworthy delegates.

An ARC Prize for the trustworthy-personalization ecosystem

We could catalyze research into credible personal advocates by creating a benchmark for personal-values alignment. Here’s an initial sketch:

The benchmark would evaluate subjects on an efficient frontier between fidelity to their principal’s interests and clarification-burden on the part of the principal (perhaps measured in tokens)3. The subject would make decisions in a deeply personal domain (e.g., scheduling activities on the principal’s calendar) and fidelity would be measured either against a programmatic value function or using LLM-as-a-judge techniques (each approach has tradeoffs and is worth investigating). In any case, the applied value function must be conditioned on personal idiosyncrasies and must be difficult to transfer to the subject en masse.

The subject would be conditioned on a synthetic dataset mimicking the kinds of data that are realistically accessible to a secure personal AI: a large (but non-comprehensive) history of actions taken by the principal plus a record of feedback and clarifications accumulated during evaluation. The accumulative memory lets us evaluate not only the subject’s per-task calibration of action vs. clarification, but the long-term efficiency of their value-elicitation approach.4

This could be the first in a series of benchmarks that test for efficient personal-values alignment in realistic settings. Much as the ARC Prize catalyzed open-source research into general intelligence and program synthesis, we could build the community and incentive structure for R&D into trustworthy digital advocates.

…which I haven’t read, but which paints an expansive vision of the future that I believe encompasses entirely my outlined vision above (I learned of the book’s ideas from an interview of the author on the Cognitive Revolution podcast)

source: Gwern’s Guardian Angels, which details why current LLM products are far from sufficient for personal delegation (I summarize Gwern’s arguments in my Guardian Angels notes)

The benchmark might even measure performance on a 3-dimensional frontier between cost-to-agent (like the cost axis on ARC-AGI graphs), elicitation burden / cost-to-principal, and the score of the agent’s submissions. It seems to me that those three dimensions basically describe the costs and benefits we wish to balance in our personal advocates.

Overall, the benchmark can be seen as testing a Cooperative Inverse Reinforcement Learning (CIRL) problem.

Daniel Sosebee

Discussion about this post

Ready for more?