Jackpot! Alignment as a Maximal Lottery

Marc Lanctot; Francesco Visin; Kate Larson

Jackpot! Alignment as a Maximal Lottery

Fiche du document

Auteurs

Date

31 janvier 2025

Discipline

Economies et finances

Type de document

Textes imprimés

Périmètre

Publications

Identifiant

2501.19266

Source

arXiv - économie

Collection

arXiv

Organisation

Cornell University

Mots-clés Und

Computer Science - Artificial Intelligence Computer Science - Machine Learning Economics - Theoretical Economics

Sujets proches En Fr

informatisation

Citer ce document

Marc Lanctot et al., « Jackpot! Alignment as a Maximal Lottery », arXiv - économie

Partage / Export

Résumé 0

Reinforcement Learning from Human Feedback (RLHF), the standard for aligning Large Language Models (LLMs) with human values, is known to fail to satisfy properties that are intuitively desirable, such as respecting the preferences of the majority \cite{ge2024axioms}. To overcome these issues, we propose the use of a probabilistic Social Choice rule called \emph{maximal lotteries} as a replacement for RLHF. We show that a family of alignment techniques, namely Nash Learning from Human Feedback (NLHF) \cite{munos2023nash} and variants, approximate maximal lottery outcomes and thus inherit its beneficial properties. We confirm experimentally that our proposed methodology handles situations that arise when working with preferences more robustly than standard RLHF, including supporting the preferences of the majority, providing principled ways of handling non-transitivities in the preference data, and robustness to irrelevant alternatives. This results in systems that better incorporate human values and respect human intentions.

Jackpot! Alignment as a Maximal Lottery

Fiche du document

Mots-clés Und

Sujets proches En Fr

Citer ce document

Partage / Export

Résumé 0

Par les mêmes auteurs

Sur les mêmes sujets

Sur les mêmes disciplines