This is a collection of research papers for Reinforcement Learning with Human Feedback (RLHF). And the repository will be continuously updated to track the frontier of RLHF.
Welcome to follow and star!
The idea of RLHF is to use methods from reinforcement learning to directly optimize a language model with human feedback. RLHF has enabled language models to begin to align a model trained on a general corpus of text data to that of complex human values.
- RLHF for Large Language Model (LLM)
- RLHF for Video Game (e.g. Atari)
(The following section was automatically generated by ChatGPT)
RLHF typically refers to "Reinforcement Learning with Human Feedback". Reinforcement Learning (RL) is a type of machine learning that involves training an agent to make decisions based on feedback from its environment. In RLHF, the agent also receives feedback from humans in the form of ratings or evaluations of its actions, which can help it learn more quickly and accurately.
RLHF is an active research area in artificial intelligence, with applications in fields such as robotics, gaming, and personalized recommendation systems. It seeks to address the challenges of RL in scenarios where the agent has limited access to feedback from the environment and requires human input to improve its performance.
Reinforcement Learning with Human Feedback (RLHF) is a rapidly developing area of research in artificial intelligence, and there are several advanced techniques that have been developed to improve the performance of RLHF systems. Here are some examples:
-
Inverse Reinforcement Learning (IRL): IRL is a technique that allows the agent to learn a reward function from human feedback, rather than relying on pre-defined reward functions. This makes it possible for the agent to learn from more complex feedback signals, such as demonstrations of desired behavior. -
Apprenticeship Learning: Apprenticeship learning is a technique that combines IRL with supervised learning to enable the agent to learn from both human feedback and expert demonstrations. This can help the agent learn more quickly and effectively, as it is able to learn from both positive and negative feedback. -
Interactive Machine Learning (IML): IML is a technique that involves active interaction between the agent and the human expert, allowing the expert to provide feedback on the agent's actions in real-time. This can help the agent learn more quickly and efficiently, as it can receive feedback on its actions at each step of the learning process. -
Human-in-the-Loop Reinforcement Learning (HITLRL): HITLRL is a technique that involves integrating human feedback into the RL process at multiple levels, such as reward shaping, action selection, and policy optimization. This can help to improve the efficiency and effectiveness of the RLHF system by taking advantage of the strengths of both humans and machines.
Here are some examples of Reinforcement Learning with Human Feedback (RLHF):
-
Game Playing: In game playing, human feedback can help the agent learn strategies and tactics that are effective in different game scenarios. For example, in the popular game of Go, human experts can provide feedback to the agent on its moves, helping it improve its gameplay and decision-making. -
Personalized Recommendation Systems: In recommendation systems, human feedback can help the agent learn the preferences of individual users, making it possible to provide personalized recommendations. For example, the agent could use feedback from users on recommended products to learn which features are most important to them. -
Robotics: In robotics, human feedback can help the agent learn how to interact with the physical environment in a safe and efficient manner. For example, a robot could learn to navigate a new environment more quickly with feedback from a human operator on the best path to take or which objects to avoid. -
Education: In education, human feedback can help the agent learn how to teach students more effectively. For example, an AI-based tutor could use feedback from teachers on which teaching strategies work best with different students, helping to personalize the learning experience.
You can also visit this link to get an AI-enhanced paper reading experience.
format:
- [title](paper link) [links]
- author1, author2, and author3...
- publisher
- keyword
- code
- experiment environments and datasets
-
Why DPO is a Misspecified Estimator and How to Fix It
- Aditya Gopalan, Sayak Ray Chowdhury, Debangshu Banerjee
- Keyword: DPO, RLHF, Preference, Alignment, LLM
-
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
- Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson
- Keyword: RLHF, Preference, Alignment, Safety, Human Feedback
-
Multiplayer Nash Preference Optimization
- Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi
- Keyword: PPO, RLHF, Preference, Alignment, LLM
-
Token-Importance Guided Direct Preference Optimization
- Ning Yang, Hai Lin, Yibo Liu, Baoliang Tian, Guoqing Liu, Haijun Zhang
- Keyword: DPO, RLHF, Preference, Alignment, LLM
-
SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety
- Geon-Hyeong Kim, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae, Youngsoo Jang, Moontae Lee
- Keyword: DPO, RLHF, Reward Model, Preference, Alignment
-
BaseReward: A Strong Baseline for Multimodal Reward Model
- YiFan Zhang, Haihua Yang, Huanyu Zhang, Yang Shi, Zezhou Chen, Haochen Tian, Chaoyou Fu, Kai WU, Bo Cui, Xu Wang, Jianfei Pan, Haotian Wang, Zhang Zhang, Liang Wang
- Keyword: RLHF, Reward Model, Preference, Multimodal, LLM
-
The Alignment Auditor: A Bayesian Framework for Verifying and Refining LLM Objectives
- Matthieu Bou, Nyal Patel, Arjun Jagota, Satyapriya Krishna, Sonali Parbhoo
- Keyword: RLHF, Preference, Alignment, Safety, LLM
-
Uni-DPO: A Unified Paradigm for Dynamic Preference Optimization of LLMs
- Shangpin Peng, Weinong Wang, Zhuotao Tian, Senqiao Yang, Xing W, Haotian Xu, Chengquan Zhang, Takashi Isobe, Baotian Hu, Min Zhang
- Keyword: DPO, RLHF, Preference, Multimodal, LLM
-
Learning to summarize user information for personalized reinforcement learning from human feedback
- HyunJi Nam, Yanming Wan, Mickel Liu, Peter F. Ahnn, Jianxun Lian, Natasha Jaques
- Keyword: RLHF, Reward Model, Preference, Alignment, LLM
-
Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding
- Yifan Zhu, Huiqiang Rong, Haoran Luo
- Keyword: RLHF, LLM, Token-level, Reinforcement Learning, Human Feedback
-
Pretrain Value, Not Reward: Decoupled Value Policy Optimization
- Chenghua Huang, Lu Wang, Fangkai Yang, Pu Zhao, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan
- Keyword: RLHF, Reward Model, Preference, LLM, Optimization
-
- ruipeng zhang, Zhihao Li, Haozhang Yuan, C.L.Philip Chen, Tong Zhang
- Keyword: DPO, Preference, Optimization, Human Feedback
-
Unifying Stable Optimization and Reference Regularization in RLHF
- Li He, Qiang Qu, He Zhao, Stephen Wan, Dadong Wang, Lina Yao, Tongliang Liu
- Keyword: RLHF, Preference, Alignment, Optimization, Reinforcement Learning
-
Text2Grad: Reinforcement Learning from Natural Language Feedback
- Hanyang Wang, Lu Wang, Chaoyun Zhang, Tianjun Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
- Keyword: RLHF, Reward Model, Alignment
-
ARMOR: Aligning Secure and Safe Large Language Models via Meticulous Reasoning
- Zhengyue Zhao, YingziYingzi Ma, Somesh Jha, Marco Pavone, Patrick McDaniel, Chaowei Xiao
- Keyword: RLHF, Alignment, Safety, LLM, Optimization
-
Reward Model Routing in Alignment
- Xinle Wu, Yao Lu
- Keyword: RLHF, Reward Model, Preference, Alignment, LLM
-
- Shuhang Chen, Yunqiu Xu, Junjie Xie, Aojun Lu, Tao Feng, ZEYING HUANG, ZHANG NING, Yi Sun, Yi Yang, Hangjie Yuan
- Keyword: Reward Model, Multimodal, Optimization
-
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning
- Gokul Swamy, Sanjiban Choudhury, Wen Sun, Steven Wu, Drew Bagnell
- Keyword: PPO, Reward Model, Preference, Reinforcement Learning
-
General Exploratory Bonus for Optimistic Exploration in RLHF
- Wendi Li, Changdae Oh, Sharon Li
- Keyword: RLHF, Alignment, Reinforcement Learning, Human Feedback
-
RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment
- Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu
- Keyword: DPO, RLHF, Preference, Alignment, LLM
-
Learning Correlated Reward Models: Statistical Barriers and Opportunities
- Yeshwanth Cherapanamjeri, Constantinos Costis Daskalakis, Gabriele Farina, Sobhan Mohammadpour
- Keyword: RLHF, Reward Model, Preference, Reinforcement Learning, Human Feedback
-
Verification and Co-Alignment via Heterogeneous Consistency for Preference-Aligned LLM Annotations
- Cheng Chen, Haiyan Yin, Ivor Tsang
- Keyword: RLHF, Preference, Alignment, LLM
-
Disentangling Length Bias in Preference Learning via Response-Conditioned Modeling
- Jianfeng Cai, Jinhua Zhu, Ruopei Sun, Yue Wang, Li Li, Wengang Zhou, Houqiang Li
- Keyword: DPO, RLHF, Reward Model, Preference, LLM
-
Enforcing Axioms for AI Alignment under Loss-Based Rules
- Alexandros Hollender, Sonja Kraiczy
- Keyword: RLHF, Reward Model, Preference, Alignment, Reinforcement Learning
-
OPPO: Accelerating PPO-based RLHF via Pipeline Overlap
- Kaizhuo Yan, YingJie Yu, Yifan Yu, Haizhong Zheng, Fan Lai
- Keyword: PPO, RLHF, Reward Model, Preference, LLM
-
QuRL: Rubrics As Judge For Open-Ended Question Answering
- Xiyu Wei, Qingwei Zong, Xiaoguang Li, Eugene J. Yu, Sujian Li
- Keyword: LLM, Optimization, Reinforcement Learning, Human Feedback
-
Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations
- Xinyi Yang, Liang Zeng, Heng Dong, Chao Yu, Xiaoran Wu, Huazhong Yang, Yu Wang, Milind Tambe, Tonghan Wang
- Keyword: RLHF, LLM, Reinforcement Learning
-
Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
- Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu
- Keyword: RLHF, Reward Model, Preference, Alignment, Safety
-
Stackelberg Learning from Human Feedback: Preference Optimization as a Sequential Game
- Barna Pásztor, Thomas Kleine Buening, Andreas Krause
- Keyword: RLHF, Preference, Alignment, Nash, Optimization
-
Swap-guided Preference Learning for Personalized Reinforcement Learning from Human Feedback
- Gihoon Kim, Euntai Kim
- Keyword: RLHF, Reward Model, Preference, Reinforcement Learning, Human Feedback
-
Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback
- Amirhossein Afsharrad, Ruida Zhou, Luca Viano, Sanjay Lall, Mohammad Ghavamzadeh
- Keyword: Reward Model, Preference, Safety, Human Feedback
-
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
- Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev
- Keyword: RLHF, Reward Model, Preference, Alignment, LLM
-
Reward Models Inherit Value Biases from Pretraining
- Brian Christian, Jessica A F Thompson, Elle, Vincent Adam, Hannah Rose Kirk, Christopher Summerfield, Tsvetomira Dumbalska
- Keyword: Reward Model, Preference, Alignment, Safety, LLM
-
Semantic-aware Wasserstein Policy Regularization for Large Language Model Alignment
- Byeonghu Na, Hyungho Na, Yeongmin Kim, Suhyeon Jo, HeeSun Bae, Mina Kang, Il-chul Moon
- Keyword: RLHF, Preference, Alignment, LLM, Reinforcement Learning
-
RewardBench 2: Advancing Reward Model Evaluation
- Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Morrison, Noah A. Smith, Hannaneh Hajishirzi, Nathan Lambert
- Keyword: RLHF, Reward Model, Preference, Alignment, Safety
-
Causally Robust Reward Learning from Reason-Augmented Preference Feedback
- Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem Biyik
- Keyword: Reward Model, Preference
-
COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences
- Yixin Liu, Argyris Oikonomou, Weiqiang Zheng, Yang Cai, Arman Cohan
- Keyword: RLHF, Preference, Alignment, Nash, Optimization
-
Displacement-Resistant Extensions of DPO with Nonconvex $f$-Divergences
- Idan Pipano, Shoham Sabach, Kavosh Asadi, Mohammad Ghavamzadeh
- Keyword: DPO, RLHF
-
Keep the Best, Forget the Rest: Reliable Alignment with Order-Aware Preference Optimization
- Jiahui Zhu, Yuanjie Shi, Xiyue Peng, Xin Liu, Yan Yan, Honghao Wei
- Keyword: DPO, PPO, RLHF, Preference, Alignment
-
Cultivating Pluralism In Algorithmic Monoculture: The Community Alignment Dataset
- Lily H Zhang, Smitha Milli, Karen Long Jusko, Jonathan Smith, Brandon Amos, Wassim Bouaziz, Manon Revel, Jack Kussman, Yasha Sheynin, Lisa Titus, Bhaktipriya Radharapu, Jane Yu, Vidya Sarma, Kristopher Rose, Maximilian Nickel
- Keyword: Preference, Alignment, LLM
-
Fair Reinforcement Learning for Just AI
- Ezgi Korkmaz
- Keyword: Preference, Alignment, Optimization, Reinforcement Learning, Human Feedback
-
Robust Reward Modeling via Causal Rubrics
- Pragya Srivastava, Harman Singh, Rahul Madhavan, Gandharv Patil, Sravanti Addepalli, Arun Suggala, Rengarajan Aravamudhan, Soumya Sharma, Anirban Laha, Aravindan Raghuveer, Karthikeyan Shanmugam, Doina Precup
- Keyword: DPO, Reward Model, Alignment, Safety, LLM
-
Escaping Policy Contraction: Contraction-Aware PPO (CaPPO) for Stable Language Model Fine-Tuning
- Dun Yuan, Di Wu, Xue Liu
- Keyword: PPO, RLHF, Alignment, Optimization, Reinforcement Learning
-
Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling
- Yuxuan Tang, Yifan Feng
- Keyword: DPO, PPO, Preference, Alignment, LLM
-
Learning Ordinal Probabilistic Reward from Preferences
- Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, Min Yang
- Keyword: Reward Model, LLM
-
Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance
- Zhuo Li, Pengyu Cheng, Zhechao Yu, FeifeiTong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, erchao.zec, xiaoxi jiang, guanjunjiang
- Keyword: RLHF, Reward Model, Preference, LLM, Optimization
-
Alignment-Weighted DPO: A principled reasoning approach to improve safety alignment
- Mengxuan Hu, Vivek Datla, Anoop Kumar, Zihan Guan, Sheng Li, Alfy Samuel, Daben Liu
- Keyword: DPO, RLHF, Preference, Alignment, Safety
-
Evaluating and Improving Cultural Awareness of Reward Models for LLM Alignment
- Hongbin Zhang, Kehai Chen, Xuefeng Bai, Yang Xiang, Min Zhang
- Keyword: Reward Model, Preference, Alignment, LLM, Reinforcement Learning
-
Balancing the Experts: Unlocking LoRA-MoE for GRPO via Mechanism-Aware Rewards
- Changlian Ma, Zizheng Huang, Xiangyu Zeng, Yi Wang, Cheng Liang, Kun Tian, Xinhai Zhao, Limin Wang
- Keyword: Alignment, Multimodal, Optimization, Reinforcement Learning
-
Bradley-Terry and Multi-Objective Reward Modeling Are Complementary
- Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Linlin Wu, Zhen Li, Chen Luo, Zongyu Wu, Xianfeng Tang, Qi He, Suhang Wang
- Keyword: RLHF, Reward Model, Preference, LLM, Reinforcement Learning
-
Beyond RLHF and NLHF: Population-Proportional Alignment under an Axiomatic Framework
- Kihyun Kim, Jiawei Zhang, Asuman E. Ozdaglar, Pablo A. Parrilo
- Keyword: Preference, Alignment
-
ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment
- Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low
- Keyword: DPO, Reward Model, Preference, Alignment, LLM
-
BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models
- Yuming Li, Yikai Wang, Yuying zhu, Zhongyu Zhao, Ming Lu, Qi She, Shanghang Zhang
- Keyword: Preference, Alignment, Optimization
-
Safety Game: Inference-Time Alignment of Black-Box LLMs via Constrained Optimization
- Tuan Nguyen, Long Tran-Thanh
- Keyword: Alignment, Safety, LLM, Reinforcement Learning, Human Feedback
-
Threshold-Guided Optimization for Visual Generative Models
- Jinbin Bai, Yu Lei, Qingyu Shi, Aosong Feng, Yi Xin, Zhuoran Zhao, Fei Shen, Kaidong Yu, Xiangtai Li
- Keyword: Reward Model, Preference, Alignment, Diffusion, Optimization
-
Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients
- Omar Elmansouri, Fathinah Izzati, Mohamed El Amine Seddik, Salem Lahlou
- Keyword: RLHF, Reward Model, LLM, Optimization, Reinforcement Learning
-
Controllable and explainable personality sliders for LLMs at inference time
- Florian Hoppe, David Khachaturov, Robert Mullins, Mark Huasong Meng
- Keyword: PPO, RLHF, Alignment, LLM, Optimization
-
PS-PPO : Prefix-Sampling PPO for Critic-Free RLHF
- Doo Hwan Hwang, Kee-Eung Kim
- Keyword: PPO, RLHF, Optimization, Reinforcement Learning, Human Feedback
-
Unbiased Reward Modeling from Implicit Preference
- Eric Wang, Haocheng Yang, Licheng Pan, Lei Shen, Xiaoxi Li, Yinuo Wang, Zhichao Chen, Yuan Lu, Haoxuan Li, Zhouchen Lin
- Keyword: RLHF, Reward Model, Preference, Reinforcement Learning, Human Feedback
-
- Itai Shapira, Gerdus Benade, Ariel Procaccia
- Keyword: Preference, Alignment, Optimization, Human Feedback
-
Real-Time Aligned Reward Model beyond Semantics
- Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Huaqiu Li, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, deqing wang
- Keyword: RLHF, Reward Model, Preference, Alignment, LLM
-
Chasing Moving Targets with Online Self-Play Reinforcement Learning for Safer Language Models
- Mickel Liu, Liwei Jiang, Yancheng Liang, Simon Du, Yejin Choi, Tim Althoff, Natasha Jaques
- Keyword: RLHF, Reward Model, Alignment, Safety, LLM
-
B-Spar: Bayesian Sparse-Reward Modeling for RL-based Image Editing
- shusong xu, Peiye Liu, Yongbin Liu, Bangjie Yin, Zhaomang Sun, Zhenyu Chen, Tianyi Zheng, Peng-Tao Jiang, Jian Zhang, Yuzhao Wang, Jinwei Chen, Zhen Gu, Bo Li
- Keyword: Reward Model, Alignment, Multimodal, LLM, Optimization
-
Calibrated Preference Learning: The Case of Label Ranking
- Santo Thies, Viktor Bengs, Timo Kaufmann, Sebastian Vollmer, Eyke Hüllermeier
- Keyword: RLHF, Reward Model, Alignment
-
Understanding the Performance Gap in Preference Learning: A Dichotomy of RLHF and DPO
- Ruizhe Shi, Minhak Song, Runlong Zhou, Zihan Zhang, Maryam Fazel, Simon Du
- Keyword: DPO, RLHF, Reward Model, Preference, Optimization
-
DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding
- mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu
- Keyword: DPO, RLHF, Preference, Alignment, Optimization
-
The Personality Illusion: Revealing Dissociation Between Self-Reports & Behavior in LLMs
- Pengrui Han, Rafal Kocielnik, Peiyang Song, Ramit Debnath, Dean Mobbs, Anima Anandkumar, R. Michael Alvarez
- Keyword: RLHF, Alignment, LLM
-
Distributionally Robust Reinforcement Learning with Human Feedback
- Debmalya Mandal, Paulius Sasnauskas, Goran Radanovic
- Keyword: DPO, RLHF, Reward Model, Preference, LLM
-
Automatically Finding Reward Model Biases
- Atticus Wang, Iván Arcuschin, Arthur Conmy
- Keyword: Reward Model, LLM, Reinforcement Learning, Human Feedback
-
- Zichao Li, Jie Lou, Fangchen Dong, Zhiyuan Fan, Mengjie Ren, Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, Yaojie Lu, XingYu
- Keyword: RLHF, LLM, Optimization, Reinforcement Learning
-
Convex Optimization for Alignment and Preference Learning on a Single GPU
- Miria Feng, Mert Pilanci
- Keyword: DPO, RLHF, Preference, Alignment, LLM
-
MMKU-Bench: A Multimodal Update Benchmark for Diverse Visual Knowledge
- Baochen Fu, Yuntao Du, Cheng Chang, Baihao Jin, Wenzhi Deng, Muhao Xu, Hongmei Yan, Weiye Song, Yi Wan
- Keyword: RLHF, Multimodal, Reinforcement Learning, Human Feedback
-
Pushing Forward Pareto Frontiers of Proactive Agents with Behavioral Agentic Optimization
- Yihang Yao, Zhepeng Cen, Haohong Lin, Shiqi Liu, Zuxin Liu, Jiacheng Zhu, Zhang-Wei Hong, Laixi Shi, Ding Zhao
- Keyword: LLM, Reinforcement Learning, Human Feedback
-
Unbiased Alignment for Large Language Models with Noisy Preferences
- Jialiang Wang, Xianming Liu, Xiong Zhou, Hui Liu, Haoliang Li
- Keyword: DPO, Reward Model, Preference, Alignment, Optimization
-
Unbiased Principles, Robust Rewards
- Qingnan Ren, Zhen Fang, Shiting Huang, Yu Zeng, Lin Chen, Zehui Chen, Feng Zhao
- Keyword: RLHF, Reward Model, Reinforcement Learning, Human Feedback
-
The Secret Engine Behind RLHF: It's Contarstive Learning All Along
- Xufei Lv, Kehai Chen, Haoyuan Sun, Xuefeng Bai, Min zhang, Houde Liu
- Keyword: DPO, RLHF, Preference, Alignment, LLM
-
When Distance Distracts: Representation Distance Bias in BT-Loss for Reward Models
- Tong Xie, Ching-Yuan Bai, Yuanhao Ban, Yunqi Hong, Haoyu Li, Cho-Jui Hsieh
- Keyword: RLHF, Reward Model, Alignment, LLM
-
Multi-Objective Preference Optimization: Improving Human Alignment of Generative Models
- Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng Wen
- Keyword: DPO, RLHF, Preference, Alignment, Safety
-
TUR-DPO: Topology- and Uncertainty-Aware Direct Preference Optimization
- Abdulhady abas, Fatemeh Daneshfar, Seyedali Mirjalili, Mourad Oussalah
- Keyword: DPO, PPO, RLHF, Preference, Multimodal
-
Asymptotic Universal Alignment: A New Alignment Framework via Test-Time Scaling
- Yang Cai, Weiqiang Zheng
- Keyword: PPO, Preference, Alignment, LLM, Nash
-
Reward Modeling from Natural Language Human Feedback
- Zongqi Wang, Rui Wang, Yuchuan Wu, Yiyao Yu, Pinyi Zhang, Shaoning Sun, Yujiu Yang, Yongbin Li
- Keyword: Reward Model, Preference, Reinforcement Learning, Human Feedback
-
Efficient Preference Poisoning Attack on Offline RLHF
- Chenye Yang, Weiyu Xu, Lifeng Lai
- Keyword: DPO, RLHF, Preference, Optimization, Reinforcement Learning
-
Position: Agentic Safety is an Epistemic Property, Not a Behavioral One
- Charles Wang, Keir Dorchen, Peter Jin
- Keyword: RLHF, Preference, Alignment, Safety, Optimization
-
Position: Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences
- Cristina Garbacea
- Keyword: RLHF, Reward Model, Preference, Safety, Reinforcement Learning
-
- Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li
- Keyword: PPO, RLHF, Preference, Alignment, Nash
-
Factored Causal Representation Learning for Robust Reward Modeling in RLHF
- Yupei Yang, Lin Yang, Wanxi Deng, Lin Qu, Fan Feng, Biwei Huang, Shikui Tu, Lei Xu
- Keyword: RLHF, Reward Model, Preference, LLM, Reinforcement Learning
-
- Jabin Koo, Hoyoung Kim, Minwoo Jang, Jungseul Ok
- Keyword: RLHF, Reward Model, Preference, Alignment, LLM
-
Online Compatible Reward Identification from Preference Feedback
- Simone Drago, Marco Mussi, Alberto Maria Metelli
- Keyword: Preference, Safety, Reinforcement Learning, Human Feedback
-
$f$-Divergence Regularized RLHF: Two Tales of Sampling and Unified Analyses
- Di Wu, Chengshuai Shi, Jing Yang, Cong Shen
- Keyword: RLHF, Reinforcement Learning, Human Feedback
-
- Dongyoon Hahm, Dylan Hadfield-Menell, Kimin Lee
- Keyword: RLHF, Reward Model, Preference, Alignment, LLM
-
IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward Models
- Haonan Song, Qingchen Xie, Huan Zhu, Feng Xiao, Luxi Xing, Liu Kang, Fuzhen Li, Zhiyong Zheng, Feng Jiang, Ziheng Li, Kun Yan, Qingyi Si, Yanghua Xiao, Hongcheng Guo, Fan Yang
- Keyword: RLHF, Reward Model, Preference, Reinforcement Learning, Human Feedback
-
Implicit Preference Alignment for Human Image Animation
- Yuanzhi Wang, Xuhua Ren, Jiaxiang Cheng, bing ma, Kai Yu, Tianxiang Zheng, Qinglin Lu, Zhen Cui
- Keyword: Preference, Alignment, Optimization, Reinforcement Learning, Human Feedback
-
Multilingual Safety Alignment Via Sparse Weight Editing
- Jiaming Liang, Zhaoxin Wang, Handing Wang
- Keyword: RLHF, Alignment, Safety, LLM, Reinforcement Learning
-
- Johannes Ackermann, Michael Noukhovitch, Takashi Ishida, Masashi Sugiyama
- Keyword: RLHF, Reward Model, LLM, Reinforcement Learning, Human Feedback
-
Graph-Preference Learning: Debiasing Network-Sampled Human Feedback for Target Welfare Estimation
- Guangrui Fan, DanDan Liu, AZNUL SABRI, Pan Lihu
- Keyword: DPO, RLHF, Reward Model, Preference
-
COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space
- Yao Luan, Ni Mu, Hanfei Ge, Yiqin Yang, Bo XU, Qing-Shan Jia
- Keyword: Human Feedback
-
Optimal Transport for Reward Modeling from Noisy Feedback
- Eric Wang, Licheng Pan, Haocheng Yang, Yunsheng Lu, Yongqi Tong, Yinuo Wang, Shijian Wang, Zhixuan Chu, Lei Shen, Haoxuan Li, Yuan Lu
- Keyword: RLHF, Reward Model, Preference, Reinforcement Learning, Human Feedback
-
Reliability-Aware LLM Alignment from Inconsistent Human Feedback
- Jingyi Huang, Ruohan Zong, Yujun Feng, Liran Ma, Lanyu Shang, Yang Zhang
- Keyword: DPO, RLHF, Preference, Alignment, LLM
-
Position: We Need Large Language Models Optimized For Our Well-Being
- Ashton Anderson, Harsh Kumar, Louis Tay, Karina Vold
- Keyword: RLHF, Preference, Optimization, Reinforcement Learning, Human Feedback
-
Implicit Safety Alignment from Crowd Preferences
- Qian Lin, Daniel S Brown
- Keyword: RLHF, Reward Model, Preference, Safety, Reinforcement Learning
-
Contrastive Weak-to-Strong Generalization
- Houcheng Jiang, Junfeng Fang, Jiaxin Wu, Tianyu Zhang, Chen Gao, Xiang Wang, Xiangnan He, Yang Deng
- Keyword: Reward Model, Alignment, LLM, Human Feedback
-
Distortion of AI Alignment Revisited: RLHF is a Decent Utilitarian Aligner
- Kazusato Oko, Annie Ulichney, Nika Haghtalab, Han Bao
- Keyword: RLHF, Preference, Reinforcement Learning, Human Feedback
-
- Yuan Sui, Bryan Hooi
- Keyword: LLM, Optimization, Human Feedback
-
Unifying Adversarial Robustness and Training Across Text Scoring Models
- Manveer Tamber, Hosna Oyarhoseini, Jimmy Lin
- Keyword: PPO, RLHF, Reward Model, LLM
-
ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning
- Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pasztor, Andreas Krause
- Keyword: RLHF, Preference, Alignment, LLM, Reinforcement Learning
-
- Mengyang Li, Shuang Liu, Zhong Zhang
- Keyword: DPO, RLHF, Preference, Optimization
-
The Sign Estimator: Preference Modeling for LLM Alignment under Heterogeneity
- Aymane El Gadarri, Ali Aouad, Vivek Farias
- Keyword: RLHF, Reward Model, Preference, Alignment, LLM
-
Leveraging Machine Unlearning for Cost-Efficient Preference Alignment
- XiaoHua Feng, Yuyuan Li, HuWei Ji, Li Zhang, Jiaming Zhang, Tianyu Du, Chaochao Chen
- Keyword: Preference, Alignment, LLM, Optimization, Reinforcement Learning
-
Regularization in the Axiomatic Approach to Learning from Human Preferences
- Ezgi Korkmaz
- Keyword: RLHF, Preference, Reinforcement Learning, Human Feedback
-
- Wenxuan Zhou, Shujian Zhang, brice magdalou, John Lambert, Ehsan Amid, Richard Nock, Andrew Hard
- Keyword: DPO, PPO, RLHF, Reward Model, Preference
-
Conditional Equivalence of DPO and RLHF: Assumptions, Failure Modes, and Provable Alignment
- Yonggang Zhang, Zhiqin Yang, Wei Xue, Dong Fang, Bo Han, Yike Guo
- Keyword: DPO, RLHF, Preference, Alignment, Optimization
-
A Regret Minimization Framework on Preference Learning in Large Language Models
- Suhwan Kim, Taehyun Cho, Youngsoo Jang, Geon-Hyeong Kim, Yu Jin Kim, Moontae Lee, Jungwoo Lee
- Keyword: RLHF, Preference, Optimization, Reinforcement Learning, Human Feedback
-
Position: Measuring Human Preferences in RLHF is a Social Science Problem
- Bijean Ghafouri, Eun Cheol Choi, Priyanka Dey, Emilio Ferrara
- Keyword: RLHF, Preference, Alignment
-
Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling
- Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo
- Keyword: Reward Model, Preference, LLM, Optimization, Reinforcement Learning
-
Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
- Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
- Keyword: Alignment Bias, Safety, Interpretability
-
What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data
- Rajiv Movva, Smitha Milli, Sewon Min, Emma Pierson
- Keyword: Sparse Autoencoders, Interpretable Data Curation, Reward Hacking, Feature Attribution
- Code: Official
-
Towards Efficient Online Exploration for Reinforcement Learning with Human Feedback
- Gen Li, Yuling Yan
- Keyword: Online RL, Multi-armed Bandit, LLMs
-
OpenRLHF: A Ray-based Easy-to-use, Scalable and High-performance RLHF Framework
- Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Wenkai Fang, Xianyu, Yu Cao, Haotian Xu, Yiming Liu
- Keyword: Framework
- Code: Official
-
Language Models Learn to Mislead Humans via RLHF
- Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Sam Bowman, He He, Shi Feng
- Keyword: Open-ended Task, Human Reward, Alignment Method, LLMs
-
A Simple and Effective Reinforcement Learning Method for Text-to-Image Diffusion Fine-tuning
- Shashank Gupta, Chaitanya Ahuja, Tsung-Yu Lin, Sreya Dutta Roy, Harrie Oosterhuis, Maarten de Rijke, and Satya Narayan Shukla
- Keyword: Diffusion Model, REINFORCE, PPO
-
Differential Information: An Information-Theoretic Perspective on Preference Optimization
- Yunjae Won, Hyunji Lee, Hyeonbin Hwang, Minjoon Seo
- Keyword: Preference Optimization, Information-Theoretic Analysis, Log-Ratio Reward Parameterization, Data Distribution, Log-Likelihood Displacement
-
Generalist Reward Models: Found Inside Large Language Models
- Yi-Chen Li, Tian Xu, Yang Yu, Xuqin Zhang, Xiong-Hui Chen, Zhongxiang Ling, Ningjing Chao, Lei Yuan, Zhi-Hua Zhou
- Keyword: Offline Inverse RL, LLM-as-a-judge, Training-free, Alignment
-
A Unified Pairwise Framework for RLHF: Bridging Generative Reward Modeling and Policy Optimization
- Wenyuan Xu, Xiaochen Zuo, Chao Xin, Yu Yue, Lin Yan, Yonghui Wu
- Keyword: Generative Pairwise Reward Model, Policy Optimization, Framework
-
Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
- Wei Shen, Guanlin Liu, Zheng Wu, Ruofei Zhu, Qingping Yang, Chao Xin, Yu Yue, Lin Yan
- Keyword: Data Scaling, Reward Hacking, LLMs
-
RLTHF: Targeted Human Feedback for LLM Alignment
- Yifei Xu, Tusher Chakraborty, Emre Kıcıman, Bibek Aryal, Eduardo Rodrigues, Srinagesh Sharma, Roberto Estevao, Maria Angels de Luis Balaguer, Jessica Wolk, Rafael Padilha, Leonardo Nunes, Shobana Balakrishnan, Songwu Lu, Ranveer Chandra
- Keyword: Human-AI Hybrid Framework, Efficient, Alignment, LLMs
-
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
- Yingshui Tan, Yilei Jiang, Yanshi Li, Jiaheng Liu, Xingyuan Bu, Wenbo Su, Xiangyu Yue, Xiaoyong Zhu, Bo Zheng
- Keyword: Safety, Framework, Adaptive Message-wise Alignment Method, LLMs
-
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment
- Yi-Fan Zhang, Tao Yu, Haochen Tian, Chaoyou Fu, Peiyan Li, Jianshu Zeng, Wulin Xie, Yang Shi, Huanyu Zhang, Junkang Wu, Xue Wang, Yibo Hu, Bin Wen, Fan Yang, Zhang Zhang, Tingting Gao, Di Zhang, Liang Wang, Rong Jin, Tieniu Tan
- Keyword: Critique-based Reward Model, Dynamic Reward, Dataset
- Code: Official
-
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback
- Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng
- Keyword: Test-Time Optimization, Preference Learning, Iterative Feedback
- Code: Official
-
Segmenting Text and Learning Their Rewards for Improved RLHF in Language Model
- Yueqin Yin, Shentao Yang, Yujia Xie, Ziyi Yang, Yuting Sun, Hany Awadalla, Weizhu Chen, and Mingyuan Zhou
- Keyword: Segment-level Reward Model, Dense Reward RLHF Framework, Improved PPO training for LLMs
- Code: Official
-
REINFORCE++: A Simple and Efficient Approach for Aligning Large Language Models
- Jian Hu
- Keyword: Efficient, Alignment, Reinforcement Learning
- Code: Official
-
DPO Meets PPO: Reinforced Token Optimization for RLHF
- Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, Liwei Wang
- Keyword: Token-wise Reward, DPO, PPO, RLHF
- Code: Official
-
Reward-Augmented Data Enhances Direct Preference Alignment of LLMs
- Shenao Zhang, Zhihan Liu, Boyi Liu, Yufeng Zhang, Yingxiang Yang, Yongfei Liu, Liyu Chen, Tao Sun, Zhaoran Wang
- Keyword: Reward-Augmented Data, DPO, LLMs
- Code: Official
-
The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models
- Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu Shen
- Keyword: Reward Model Evaluation, Accuracy Paradox, LLM Alignment
- Code: Official
-
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
- Jiaming Ji, Jiayi Zhou, Hantao Lou, Boyuan Chen, Donghai Hong, Xuyao Wang, Wenqi Chen, Kaile Wang, Rui Pan, Jiahao Li, Mohan Wang, Josef Dai, Tianyi Qiu, Hua Xu, Dong Li, Weipeng Chen, Jun Song, Bo Zheng, Yaodong Yang
- Keyword: Multi-modality Alignment, Dataset, Training-evaluation Framework
- Code: Official
-
REvolve: Reward Evolution with Large Language Models using Human Feedback
- Rishi Hazra, Alkis Sygkounas, Andreas Persson, Amy Loutfi, Pedro Zuidberg Dos Martires
- Keyword: Improved Reward Model with LLMs, Framework
- Code: Official
-
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference
- Qining Zhang, Lei Ying
- Keyword: Reward inference-free RLHF, Zeroth-order optimization, Policy gradient
-
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment
- Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong
- Keyword: Joint Reward and Policy, Efficiency, Framework
-
MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
- Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, Hua Wu
- Keyword: Macro action-level Reward, Efficiency, Framework
- Code: Official
-
Reward Modeling with Ordinal Feedback: Wisdom of the Crowd
- Shang Liu, Yu Pan, Guanting Chen, and Xiaocheng Li
- Keyword: Reward Modeling, Ordinal Feedback, Human Preference Dataset
- Code: Official
-
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning
- Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Bo Du, Dacheng Tao
- Keyword: Diffusion Models, Text-to-Image, Alignment, Reinforcement Learning
- Code: Official
-
HybridFlow: A Flexible and Efficient RLHF Framework
- Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu
- Keyword: Flexible, Efficient, RLHF framework
- Code: Official
-
ALaRM: Align Language Models via Hierarchical Rewards Modeling
- Yuhang Lai, Siyuan Wang, Shujun Liu, Xuanjing Huang, Zhongyu Wei
- Keyword: Hierarchical Reward, Open Text Generation Tasks
- Code: Official
-
TLCR: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback
- Eunseop Yoon, Hee Suk Yoon, SooHwan Eom, Gunsoo Han, Daniel Wontae Nam, Daejin Jo, Kyoung-Woon On, Mark A. Hasegawa-Johnson, Sungwoong Kim, Chang D. Yoo
- Keyword: Token-Level Continuous Reward, RLHF
- Code: Official
-
Aligning Large Multimodal Models with Factually Augmented RLHF
- Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu, Chunyuan Li, Yikang Shen, Chuang Gan, Liang-Yan Gui, Yu-Xiong Wang, Yiming Yang, Kurt Keutzer, Trevor Darrell
- Keyword: Factually Augmented RLHF, Vision & Language, Human Preference Dataset
- Code: Official
-
Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation
- Aiwei Liu, Haoping Bai, Zhiyun Lu, Xiang Kong, Simon Wang, Jiulong Shan, Meng Cao, Lijie Wen
- Keyword: Without Human Preference Data, Self-Reward, DPO
- Code: Official
-
- Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang
- Keyword: User Preference, Multi-objective Reward Model, Rejection Sampling Finetuning
- Code: Official
-
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs
- Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, Sara Hooker
- Keyword: Online RL Optimization, Low Computational Cost
- Code: Official
-
- Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, Ji-Rong Wen
- Keyword: Token-level Reward, LLM
- Code: Official
-
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
- Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, Sushant Prakash
- Keyword: RL from AI Feedback
- Code: official
-
Principled Penalty-based Methods for Bilevel Reinforcement Learning and RLHF
- Han Shen, Zhuoran Yang, Tianyi Chen
- Keyword: Bilevel optimization
- Code: official
-
Dense Reward for Free in Reinforcement Learning from Human Feedback
- Alex James Chan, Hao Sun, Samuel Holt, Mihaela Van Der Schaar
- Keyword: reward shaping, RLHF
- Code: official
-
A Minimaximalist Approach to Reinforcement Learning from Human Feedback
- Gokul Swamy, Christoph Dann, Rahul Kidambi, Steven Wu, Alekh Agarwal
- Keyword: Minimax Winner, Self-Play Preference Optimization
- Code: official
-
- Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, Tat-Seng Chua
- Keyword: Multimodal Large Language Models, Hallucination Problem, Reinforcement Learning from Human Feedback
- Code: official
-
RLHF Workflow: From Reward Modeling to Online RLHF
- Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
- Keyword: Online Iterative RLHF, Preference Modeling, Large Language Models
- Code: official
-
MaxMin-RLHF: Towards equitable alignment of large language models with diverse human preferences
- Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang
- Keyword: mixture of preference distributions, MaxMin alignment objective
- Code: official
-
Dataset Reset Policy Optimization for RLHF
- Jonathan D. Chang, Wenhao Zhan, Owen Oertell, Kianté Brantley, Dipendra Misra, Jason D. Lee, Wen Sun
- Keyword: Dataset Reset Policy Optimization
- Code: official
-
A Dense Reward View on Aligning Text-to-Image Diffusion with Preference
- Shentao Yang, Tianqi Chen, Mingyuan Zhou
- Keyword: RLHF for Text-to-Image Generation, Dense Reward Improvement of DPO, Efficient Alignment
- Code: official
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
- Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu
- Keyword: Self-Play Fine-Tuning
- Code: official
-
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs
- Shreyas Chaudhari, Pranjal Aggarwal, Vishvak Murahari, Tanmay Rajpurohit, Ashwin Kalyan, Karthik Narasimhan, Ameet Deshpande, Bruno Castro da Silva
- Keyword: RLHF, Oracular Reward, Reward Model Analysis, Survey
-
- Ziyi Zhang, Sen Zhang, Yibing Zhan, Yong Luo, Yonggang Wen, Dacheng Tao
- Keyword: Diffusion Models, Alignment, Reinforcement Learning, RLHF, Reward Overoptimization, Primacy Bias
- Code: official
-
On Diversified Preferences of Large Language Model Alignment
- Dun Zeng, Yong Dai, Pengyu Cheng, Tianhao Hu, Wanshun Chen, Nan Du, Zenglin Xu
- Keyword: Aligning shared preference, Reward modeling metrics, LLM
- Code: official
-
Aligning Crowd Feedback via Distributional Preference Reward Modeling
- Dexun Li, Cong Zhang, Kuicai Dong, Derrick Goh Xin Deik, Ruiming Tang, Yong Liu
- Keyword: RLHF, Preference distribution, Aligning, LLM
-
Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization
- Zhanhui Zhou, Jie Liu, Chao Yang, Jing Shao, Yu Liu, Xiangyu Yue, Wanli Ouyang, Yu Qiao
- Keyword: Multi-objective RLHF without reward modeling, DPO
- Code: official
-
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
- Zhanhui Zhou, Jie Liu, Zhichen Dong, Jiaheng Liu, Chao Yang, Wanli Ouyang, Yu Qiao
- Keyword: LLM inference-time attack, DPO, Producing harmful LLMs without training
- Code: official
-
A Theoretical Analysis of Nash Learning from Human Feedback under General KL-Regularized Preference
- Chenlu Ye, Wei Xiong, Yuheng Zhang, Nan Jiang, Tong Zhang
- Keyword: Game-based RLHF, Nash Learning, Alignment under reward-model-free oracle
-
Mitigating the Alignment Tax of RLHF
- Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng Liu, Jipeng Zhang, Rui Pan, Haoxiang Wang, Wenbin Hu, Hanning Zhang, Hanze Dong, Renjie Pi, Han Zhao, Nan Jiang, Heng Ji, Yuan Yao, Tong Zhang
- Keyword: RLHF, Alignment tax, Catastrophic forgetting
-
Training Diffusion Models with Reinforcement Learning
- Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, Sergey Levine
- Keyword: reinforcement learning, RLHF, diffusion models
- Code: official
-
AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model
- Zibin Dong, Yifu Yuan, Jianye Hao, Fei Ni, Yao Mu, Yan Zheng,Yujing Hu, Tangjie Lv, Changjie Fan, Zhipeng Hu
- Keyword: Reinforcement learning; Diffusion models; RLHF; Preference aligning
- Code: official
-
Dense Reward for Free in Reinforcement Learning from Human Feedback
- Alex J. Chan, Hao Sun, Samuel Holt, Mihaela van der Schaar
- Keyword: RLHF
- Code: official
-
Transforming and Combining Rewards for Aligning Large Language Models
- Zihao Wang, Chirag Nagpal, Jonathan Berant, Jacob Eisenstein, Alex D'Amour, Sanmi Koyejo, Victor Veitch
- Keyword: RLHF, Aligning, LLM
-
Parameter Efficient Reinforcement Learning from Human Feedback
- Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin, Zhang Chen, Zac Yu, Jarvis Jin, Simral Chaudhary, Roman Komarytsia, Christiane Ahlheim, Yonghao Zhu, Bowen Li, Saravanan Ganesh, Bill Byrne, Jessica Hoffmann, Hassan Mansoor, Wei Li, Abhinav Rastogi, Lucas Dixon
- Keywords: RLHF, Parameter Efficient method, Low Computational Cost, LLM, VLM
-
Improving Reinforcement Learning from Human Feedback with Efficient Reward Model Ensemble
- Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, Chuang Gan
- Keywords: RLHF, Reward Ensemble, Efficient Ensemble Method
-
RIME: Robust Preference-based Reinforcement Learning with Noisy Human Preferences
- Jie Cheng, Gang Xiong, Xingyuan Dai, Qinghai Miao, Yisheng Lv, Fei-Yue Wang
- Keyword:
- Code: official
-
The Trickle-down Impact of Reward (In-)consistency on RLHF
- Lingfeng Shen, Sihao Chen, Linfeng Song, Lifeng Jin, Baolin Peng, Haitao Mi, Daniel Khashabi, Dong Yu
- Keyword: Reward model, RLHF, Reward hacking
- Code: official
-
A General Theoretical Paradigm to Understand Learning from Human Preferences
- Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, Rémi Munos
- Keywords: RLHF, Pairwise Preference
-
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
- Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, Hannaneh Hajishirzi
- Keyword: RLHF, Sentence-level Reward, LLM
- Code: official
-
Preference-grounded Token-level Guidance for Language Model Fine-tuning
- Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, Mingyuan Zhou
- Keyword: RLHF, Token-level Training Guidance, Alternate/Online Training Framework, Minimalist Training Objectives
- Code: official
-
- Yihao Feng*, Shentao Yang*, Shujian Zhang, Jianguo Zhang, Caiming Xiong, Mingyuan Zhou, Huan Wang
- Keyword: RLHF, Genralized Reward Function Learning, Reward Function Utilization, Task-oriented Dialogue System, Learning-to-rank
- Code: official
-
Inverse Preference Learning: Preference-based RL without a Reward Function
- Joey Hejna, Dorsa Sadigh
- Keyword: Inverse Preference Learning, without reward model
- Code: official
-
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
- Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy S. Liang, Tatsunori B. Hashimoto
- Keyword: RLHF, Simulation Framework
- Code: official
-
Adversarial Preference Optimization
- Pengyu Cheng, Yifan Yang, Jian Li, Yong Dai, Nan Du
- Keyword: RLHF, GAN, Adversarial Games
- Code: official
-
- Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, Tong Zhang
- Keyword: RLHF, Iterative DPO, Mathematical foundation
-
Sample Efficient Reinforcement Learning from Human Feedback via Active Exploration
- Viraj Mehta, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Jeff Schneider, Willie Neiswanger
- Keyword: RLHF, sample efficience, exploration
-
Reinforcement Learning from Statistical Feedback: the Journey from AB Testing to ANT Testing
- Feiyang Han, Yimin Wei, Zhaofeng Liu, Yanxing Qi
- Keyword: RLHF, AB testing, RLSF
-
- Ben Pikus, Will LeVine, Tony Chen, Sean Hendryx
- Keyword: RLHF, OOD, Distribution Shift
-
Data-Efficient Alignment of Large Language Models with Human Feedback Through Natural Language
- Di Jin, Shikib Mehri, Devamanyu Hazarika, Aishwarya Padmakumar, Sungjin Lee, Yang Liu, Mahdi Namazifar
- Keyword: RLHF, data-efficient, Alignment
-
- Sarah Pan, Vladislav Lialin, Sherin Muckatira, Anna Rumshisky
- Keyword: RLHF, reasoning
-
Direct Preference-based Policy Optimization without Reward Modeling
- Gaon An, Junhyeok Lee, Xingdong Zuo, Norio Kosaka, Kyung-Min Kim, Hyun Oh Song
- Keyword: RLHF without reward modeling, Contrastive learning, Offline refinforcement learning
-
AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model
- Zibin Dong, Yifu Yuan, Jianye Hao, Fei Ni, Yao Mu, Yan Zheng, Yujing Hu, Tangjie Lv, Changjie Fan, Zhipeng Hu
- Keyword: RLHF, Alignment, Diffusion model
-
Eureka: Human-Level Reward Design via Coding Large Language Models
- Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, Anima Anandkumar
- Keyword: LLM based, reward functions design
-
Safe RLHF: Safe Reinforcement Learning from Human Feedback
- Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, Yaodong Yang
- Keyword: Sale RL, LLM fine-ture
-
Quality Diversity through Human Feedback
- Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, Joel Lehman
- Keyword: Quality Diversity, Diffusion model
-
- Ziniu Li, Tian Xu, Yushun Zhang, Yang Yu, Ruoyu Sun, Zhi-Quan Luo
- Keyword: computational efficiency, variance-reduction technique
-
Tuning computer vision models with task rewards
- André Susano Pinto, Alexander Kolesnikov, Yuge Shi, Lucas Beyer, Xiaohua Zhai
- Keyword: Reward tuning in Computer Vision
-
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
- Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, Joseph E. Gonzalez
- Keyword: Hindsight Instruction Relabeling, RLHF System, No Value Network Required
- Code: official
-
Language Instructed Reinforcement Learning for Human-AI Coordination
- Hengyuan Hu, Dorsa Sadigh
- Keyword: Human-AI coordination, Human preference alignment, Instruction conditioned RL
-
Aligning Language Models with Offline Reinforcement Learning from Human Feedback
- Jian Hu, Li Tao, June Yang, Chandler Zhou
- Keyword: Decision Transformer-based Alignment, Offline Reinforcement Learning, RLHF System
-
Preference Ranking Optimization for Human Alignment
- Feifan Song, Bowen Yu, Minghao Li, Haiyang Yu, Fei Huang, Yongbin Li and Houfeng Wang
- Keyword: Supervised Human Preference Alignment, Preference Ranking Extension
- Code: official
-
Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural Language Generation
- Patrick Fernandes, Aman Madaan, Emmy Liu, António Farinhas, Pedro Henrique Martins, Amanda Bertsch, José G. C. de Souza, Shuyan Zhou, Tongshuang Wu, Graham Neubig, André F. T. Martins
- Keyword: Natural Language Generation, Human Feedback Integration, Feedback Formalization and Taxonomy, AI Feedback and Principles-Based Judgments
-
- OpenAI
- Keyword: A large-scale, multimodal model, Transformerbased model, Fine-tuned used RLHF
- Code: official
- Dataset: DROP, WinoGrande, HellaSwag, ARC, HumanEval, GSM8K, MMLU, TruthfulQA
-
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
- Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, Tong Zhang
- Keyword: Rejection Sampling Finetuning, Alternative to PPO, Diffusion Model
- Code: official
-
RRHF: Rank Responses to Align Language Models with Human Feedback without tears
- Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, Fei Huang
- Keyword: New paradigm for RLHF
- Code: official
-
Few-shot Preference Learning for Human-in-the-Loop RL
- Joey Hejna, Dorsa Sadigh
- Keyword: Preference Learning, Interactive Learning, Multi-task Learning, Expanding the pool of available data by viewing human-in-the-loop RL
- Code: official
-
Better Aligning Text-to-Image Models with Human Preference
- Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, Hongsheng Li
- Keyword: Diffusion Model, Text-to-Image, Aesthetic
- Code: official
-
ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation
- Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, Yuxiao Dong
- Keyword: General-purpose text-to-Image human preference RM, Evaluating Text-to-Image Generative Models
- Code: official
- Dataset: COCO, DiffusionDB
-
Aligning Text-to-Image Models using Human Feedback
- Kimin Lee, Hao liu, MoonKyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Shixiang Shane Gu
- Keyword: Text-to-Image, Stable diffusion model, Reward function that predicts human feedback
-
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
- Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan
- Keyword: Visual Foundation Models, Visual ChatGPT
- Code: official
-
Pretraining Language Models with Human Preferences (PHF)
- Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez
- Keyword: Pretraining, offline RL, Decision transformer
- Code: official
-
Aligning Language Models with Preferences through f-divergence Minimization (f-DPG)
- Dongyoung Go, Tomasz Korbak, Germán Kruszewski, Jos Rozen, Nahyeon Ryu, Marc Dymetman
- Keyword: f-divergence, RL with KL penalties
-
Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons
- Banghua Zhu, Jiantao Jiao, Michael I. Jordan
- Keyword: Pessimistic MLE, Max-entropy IRL
-
The Capacity for Moral Self-Correction in Large Language Models
- Anthropic
- Keyword: Improve moral self-correction capability by increasing RLHF training
- Dataset; BBQ
- Is Reinforcement Learning (Not) for Natural Language Processing?: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization (NLPO)
- Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté,Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, Yejin Choi
- Keyword: Optimizing language generators with RL, Benchmark, Performant RL algorithm
- Code: official
- Dataset: IMDB, CommonGen, CNN Daily Mail, ToTTo, WMT-16 (en-de),NarrativeQA, DailyDialog
- Scaling Laws for Reward Model Overoptimization
- Leo Gao, John Schulman, Jacob Hilton
- Keyword: Gold reward model train proxy reward model, Dataset size, Policy parameter size, BoN, PPO
- Improving alignment of dialogue agents via targeted human judgements (Sparrow)
- Amelia Glaese, Nat McAleese, Maja Trębacz, et al.
- Keyword: Information-seeking dialogue agent, Break down the good dialogue into natural language rules, DPC, Interact with the model to elicit violation of a specific rule (Adversarial Probing)
- Dataset: Natural Questions, ELI5, QuALITY, TriviaQA, WinoBias, BBQ
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Deep Ganguli, Liane Lovitt, Jackson Kernion, et al.
- Keyword: Red team language model, Investigate scaling behaviors, Read teaming Dataset
- Code: official
- Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning
- Deborah Cohen, Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor, Craig Boutilier, Gal Elidan
- Keyword: Real-time, Open-ended dialogue system, Pairs the succinct embedding of the conversation state by language models, CAQL, CQL, BERT
- Quark: Controllable Text Generation with Reinforced Unlearning
- Ximing Lu, Sean Welleck, Jack Hessel, Liwei Jiang, Lianhui Qin, Peter West, Prithviraj Ammanabrolu, Yejin Choi
- Keyword: Fine-tuning the language model on signals of what not to do, Decision Transformer, LLM tuning with PPO
- Code: official
- Dataset: WRITINGPROMPTS, SST-2, WIKITEXT-103
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- Yuntao Bai, Andy Jones, Kamal Ndousse, et al.
- Keyword: Harmless assistants, Online mode, Robustness of RLHF training, OOD detection.
- Code: official
- Dataset: TriviaQA, HellaSwag, ARC, OpenBookQA, LAMBADA, HumanEval, MMLU, TruthfulQA
- Teaching language models to support answers with verified quotes (GopherCite)
- Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, Nat McAleese
- Keyword: Generate answers which citing specific evidence, Abstain from answering when unsure
- Dataset: Natural Questions, ELI5, QuALITY, TruthfulQA
- Training language models to follow instructions with human feedback (InstructGPT)
- Long Ouyang, Jeff Wu, Xu Jiang, et al.
- Keyword: Large Language Model, Align Language Model with Human Intent
- Code: official
- Dataset: TruthfulQA, RealToxicityPrompts
- Constitutional AI: Harmlessness from AI Feedback
- Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, et al.
- Keyword: RL from AI feedback(RLAIF), Training a harmless AI assistant through selfimprovement, Chain-of-thought style, Control AI behavior more precisely
- Code: official
- Discovering Language Model Behaviors with Model-Written Evaluations
- Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, et al.
- Keyword: Automatically generate evaluations with LMs, More RLHF makes LMs worse, LM-written evaluations are highquality
- Code: official
- Dataset: BBQ, Winogender Schemas
- Non-Markovian Reward Modelling from Trajectory Labels via Interpretable Multiple Instance Learning
- Joseph Early, Tom Bewley, Christine Evers, Sarvapali Ramchurn
- Keyword: Reward Modelling (RLHF), Non-Markovian, Multiple Instance Learning, Interpretability
- Code: official
- SURF: Semi-supervised Reward Learning with Data Augmentation for Feedback-efficient Preference-based Reinforcement Learning
- Jongjin Park, Younggyo Seo, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee
- Keyword: Semi-supervised Reward Learning, Preference Data Augmentation, RLHF Efficiency
- Reward Uncertainty for Exploration in Preference-based Reinforcement Learning
- Xinran Liang, Katherine Shu, Kimin Lee, Pieter Abbeel
- Keyword: Preference-based RL (PbRL), Exploration, Reward Uncertainty, Feedback Efficiency
- Code: official
- WebGPT: Browser-assisted question-answering with human feedback (WebGPT)
- Reiichiro Nakano, Jacob Hilton, Suchir Balaji, et al.
- Keyword: Model search the web and provide reference, Imitation learning, BC, long form question
- Dataset: ELI5, TriviaQA, TruthfulQA
- Recursively Summarizing Books with Human Feedback
- Jeff Wu, Long Ouyang, Daniel M. Ziegler, Nisan Stiennon, Ryan Lowe, Jan Leike, Paul Christiano
- Keyword: Model trained on small task to assist human evaluate broader task, BC
- Dataset: Booksum, NarrativeQA
- Revisiting the Weaknesses of Reinforcement Learning for Neural Machine Translation
- PEBBLE: Feedback-Efficient Interactive Reinforcement Learning via Relabeling Experience and Unsupervised Pre-training
- Kimin Lee, Laura Smith, Pieter Abbeel
- Keyword: Preference-based RL (PbRL), Data Efficiency, Unsupervised Pretraining, Reward Relabeling
- Code: official
- B-Pref: Benchmarking Preference-Based Reinforcement Learning
- Kimin Lee, Laura Smith, Anca Dragan, Pieter Abbeel
- Keyword: Benchmark, Preference-based RL, Simulated Human Feedback, Robustness Evaluation
- Code: official
- Learning to summarize from human feedback
- Fine-Tuning Language Models from Human Preferences
- Scalable agent alignment via reward modeling: a research direction
- Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, Shane Legg
- Keyword: Agent alignment problem, Learn reward from interaction, Optimize reward with RL, Recursive reward modeling
- Code: official
- Env: Atari
- Reward learning from human preferences and demonstrations in Atari
- Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, Dario Amodei
- Keyword: Expert demonstration trajectory preferences reward hacking problem, Noise in human label
- Code: official
- Env: Atari
- Deep TAMER: Interactive Agent Shaping in High-Dimensional State Spaces
- Garrett Warnell, Nicholas Waytowich, Vernon Lawhern, Peter Stone
- Keyword: High dimension state, Leverage the input of Human trainer
- Code: third party
- Env: Atari
- Deep reinforcement learning from human preferences
- Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei
- Keyword: Explore goal defined in human preferences between pairs of trajectories segmentation, Learn more complex thing than human feedback
- Code: official
- Env: Atari, MuJoCo
- Interactive Learning from Policy-Dependent Human Feedback
- James MacGlashan, Mark K Ho, Robert Loftin, Bei Peng, Guan Wang, David Roberts, Matthew E. Taylor, Michael L. Littman
- Keyword: Decision is influenced by current policy rather than human feedback, Learn from policy dependent feedback that converges to a local optimal
format:
- [title](codebase link) [links]
- author1, author2, and author3...
- keyword
- experiment environments, datasets or tasks
- Reinforcement Learning from Human Feedback (RLHF) in Notebooks
- Ashwani Kumar
- step-by-step, Video tutorial, Jupyter notebooks, GPT-2, Reward Model, PPO, Pedagogical
- Dataset: stanfordnlp/sst2
- Task: Generating text with positive sentiment
- Env: Google Colab
- veRL: Volcano Engine Reinforcement Learning for LLM
- ByteDance Seed MLSys Team & HKU: Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, Chuan Wu
- Keyword: Flexible, Efficient, RLHF framework
- Tasks: RLHF, Reasoning tasks including math and code.
- OpenRLHF
- OpenRLHF
- Keyword: 70B, RLHF, DeepSpeed, Ray, vLLM
- Task: An Easy-to-use, Scalable and High-performance RLHF Framework (Support 70B+ full tuning & LoRA & Mixtral & KTO).
- Potato
- David Jurgens et al.
- Keyword: Annotation, Human Evaluation, Quality Control, AI-Assisted Labeling
- Task: Portable annotation platform for human evaluation and feedback collection with 20+ annotation types and agent trace evaluation
- PaLM + RLHF - Pytorch
- Phil Wang, Yachine Zahidi, Ikko Eltociear Ashimine, Eric Alcaide
- Keyword: Transformers, PaLM architecture
- Dataset: enwik8
- lm-human-preferences
- following-instructions-human-feedback
- Long Ouyang, Jeff Wu, Xu Jiang, et al.
- Keyword: Large Language Model, Align Language Model with Human Intent
- Dataset: TruthfulQA RealToxicityPrompts
- Transformer Reinforcement Learning (TRL)
- Leandro von Werra, Younes Belkada, Lewis Tunstall, et al.
- Keyword: Train LLM with RL, PPO, Transformer
- Task: IMDB sentiment
- Transformer Reinforcement Learning X (TRLX)
- Jonathan Tow, Leandro von Werra, et al.
- Keyword: Distributed training framework, T5-based language models, Train LLM with RL, PPO, ILQL
- Task: Fine tuning LLM with RL using provided reward function or reward-labeled dataset
- RL4LMs (A modular RL library to fine-tune language models to human preferences)
- Rajkumar Ramamurthy, Prithviraj Ammanabrolu, Kianté,Brantley, Jack Hessel, Rafet Sifa, Christian Bauckhage, Hannaneh Hajishirzi, Yejin Choi
- Keyword: Optimizing language generators with RL, Benchmark, Performant RL algorithm
- Dataset: IMDB, CommonGen, CNN Daily Mail, ToTTo, WMT-16 (en-de), NarrativeQA, DailyDialog
- LaMDA-rlhf-pytorch
- Phil Wang
- Keyword: LaMDA, Attention-mechanism
- Task: Open-source pre-training implementation of Google's LaMDA research paper in PyTorch
- TextRL
- Eric Lam
- Keyword: huggingface's transformer
- Task: Text generation
- Env: PFRL, gym
- minRLHF
- Thomfoster
- Keyword: PPO, Minimal library
- Task: educational purposes
- DeepSpeed-Chat
- Microsoft
- Keyword: Affordable RLHF Training
- Dromedary
- IBM
- Keyword: Minimal human supervision, Self-aligned
- Task: Self-aligned language model trained with minimal human supervision
- FG-RLHF
- Zeqiu Wu, Yushi Hu, Weijia Shi, et al.
- Keyword: Fine-Grained RLHF, providing a reward after every segment, Incorporating multiple RMs associated with different feedback types
- Task: A framework that enables training and learning from reward functions that are fine-grained in density and multiple RMs -Safe-RLHF
- Xuehai Pan, Ruiyang Sun, Jiaming Ji, et al.
- Keyword: Support popular pre-trained models, Large human-labeled dataset, Multi-scale metrics for safety constraints verification, Customized parameters
- Task: Constrained Value-Aligned LLM via Safe RLHF
- VinePPO
- Amirhossein Kazemnejad, Milad Aghajohari, et al.
- Keyword: Performant Implementation of RL algorithms for Reasoning, PPO, DPO, RestEM, Monte Carlo Value Estimation
- Task: Reasoning tasks including MATH and GSM8K
format:
- [title](dataset link) [links]
- author1, author2, and author3...
- keyword
- experiment environments or tasks
- HH-RLHF
- Ben Mann, Deep Ganguli
- Keyword: Human preference dataset, Red teaming data, machine-written
- Task: Open-source dataset for human preference data about helpfulness and harmlessness
- Stanford Human Preferences Dataset(SHP)
- Ethayarajh, Kawin and Zhang, Heidi and Wang, Yizhong and Jurafsky, Dan
- Keyword: Naturally occurring and human-written dataset,18 different subject areas
- Task: Intended to be used for training RLHF reward models
- PromptSource
- Stephen H. Bach, Victor Sanh, Zheng-Xin Yong et al.
- Keyword: Prompted English datasets, Mapping a data example into natural language
- Task: Toolkit for creating, Sharing and using natural language prompts
- Structured Knowledge Grounding(SKG) Resources Collections
- Tianbao Xie, Chen Henry Wu, Peng Shi et al.
- Keyword: Structured Knowledge Grounding
- Task: Collection of datasets are related to structured knowledge grounding
- The Flan Collection
- Longpre Shayne, Hou Le, Vu Tu et al.
- Task: Collection compiles datasets from Flan 2021, P3, Super-Natural Instructions
- rlhf-reward-datasets
- Yiting Xie
- Keyword: Machine-written dataset
- webgpt_comparisons
- OpenAI
- Keyword: Human-written dataset, Long form question answering
- Task: Train a long form question answering model to align with human preferences
- summarize_from_feedback
- OpenAI
- Keyword: Human-written dataset, summarization
- Task: Train a summarization model to align with human preferences
- Dahoas/synthetic-instruct-gptj-pairwise
- Dahoas
- Keyword: Human-written dataset, synthetic dataset
- Stable Alignment - Alignment Learning in Social Games
- Ruibo Liu, Ruixin (Ray) Yang, Qiang Peng
- Keyword: Interaction data used for alignment training, Run in Sandbox
- Task: Train on the recorded interaction data in simulated social games
- LIMA
- Meta AI
- Keyword: without any RLHF, few carefully curated prompts and responses
- Task: Dataset used for training the LIMA model
- [OpenAI] ChatGPT: Optimizing Language Models for Dialogue
- [Hugging Face] Illustrating Reinforcement Learning from Human Feedback (RLHF)
- [ZhiHu] 通向AGI之路:大型语言模型 (LLM) 技术精要
- [ZhiHu] 大语言模型的涌现能力:现象与解释
- [ZhiHu] 中文hh-rlhf数据集上的ppo实践
- [W&B Fully Connected] Understanding Reinforcement Learning from Human Feedback (RLHF)
- [Deepmind] Learning through human feedback
- [Notion] 深入理解语言模型的突现能力
- [Notion] 拆解追溯 GPT-3.5 各项能力的起源
- [gist] Reinforcement Learning for Language Models
- [YouTube] John Schulman - Reinforcement Learning from Human Feedback: Progress and Challenges
- [OpenAI / Arize] OpenAI on Reinforcement Learning With Human Feedback
- [Encord] Guide to Reinforcement Learning from Human Feedback (RLHF) for Computer Vision
- [hijkzzz] A Survey of Reinforcement Learning from Human Feedback (RLHF)
- [Weixun Wang] Overview of RL(HF)+LLM
- [Lilian Weng] Reward Hacking in Reinforcement Learning
- Reinforcement Learning from Human Feedback by Nathan Lambert
- Reinforcement Learning for Business
- The RLHF Book
Our purpose is to make this repo even better. If you are interested in contributing, please refer to HERE for instructions in contribution.
Awesome RLHF is released under the Apache 2.0 license.

