【Coursera GenAI with LLM】 Week 3 Reinforcement Learning from Human Feedback Class Notes

news/发布时间2024/5/10 16:15:37

Helpful? Honest? Harmless? Make sure AI response in those 3 ways.

If not, we need RLHF is reduce the toxicity of the LLM.

Reinforcement learning: is a type of machine learning in which an agent learns to make decisions related to a specific goal by taking actions in an environment, with the objective of maximizing some notion of a cumulative reward. RLHF can help making personalized LLMs.

RLHF cycle: iterate until reward score is high:

  1. Select an instruct model, define your model alignment criterion (ex. helpfulness)
  2. obtain human feedback through labeler workforce to rate the completions
  3. Convert rankings into pairwise training data for the reward model
  4. Train reward model to predict preferred completion from {y_j, y_k} for prompt x
  5. Use the reward model as a binary classifier to automatically provide reward value for each prompt-completion pair
    lower reward score, worse the performance
    softmax(logits) = probabilities

RL Algorithm

  • RL algorithm updates the weights off the LLM based on the reward is signed to the completions generated by the current version off the LLM
  • ex. Q-Learning, PPO (Proximal Policy Optimization, the most popular method)
  • PPO optimize LLM to more aligned with human preferences

Reward hacking: the model will achieve high reward score but it actually doesn't align with the criterion, the quality is not improved

  • To avoid this, we can use the initial instruct model (aka reference model). * during training, we pass prompt dataset to both reference model and RL-updated LLM,

  • Then, we calculate KL Divergence Shift Penalty (a statistical measure of how different two probability distributions are) between two models

  • Add the penalty to the Reward Model, then go through PPO, PEFT, and back to reward model

Constitutional AI

  • First proposed in 2022 by researchers at Anthropic
  • a method for training models using a set of rules and principles that govern the model's behavior.

Red Teaming: make it to generate harmful responses. Then, remove all harmful responses

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.ulsteruni.cn/article/08841487.html

如若内容造成侵权/违法违规/事实不符,请联系编程大学网进行投诉反馈email:xxxxxxxx@qq.com,一经查实,立即删除!

相关文章

C# 按钮图像指定本地资源后提示“未能找到任何适合于指定的区域性或非特定区域性的资源”的解决办法

查询网上多种解决办法,均未解决,包括命名空间、Properties.Resources.resx文件设置都正常,编译通过, 但是只要执行程序都会报“未能找到任何适合于指定的区域性或非特定区域性的资源”的错误, 各种网上的方法和自己想到的可能的原因都试过了,花了两个半天时间,终于找到…

读算法的陷阱:超级平台、算法垄断与场景欺骗笔记10_中间人

中间人1. 中间人 1.1. 从积极的意义上讲,比价网站与搜索引擎这些“网络中间人”的存在有效提高了市场透明度,看似打造出了一片阻绝价格歧视、改善社会福利的乐土 1.2. 类似于“网络聚合器”的互联网巨头已经成为线上市场的重要中介 1.2.1. 网络聚合器实际上是个亦正亦邪的角色…

ssts-hospital-web-master项目实战记录三十一:项目迁移-核心模块实现(useDeviceDriver)

记录时间:2024-03-15 一、useDeviceDriver模块实现 无 二、调用示例 无 三、运行测试 翻译 搜索 复制

中考英语首字母快速突破006-2021上海嘉定英语二模-Teen Scientist Tackles Ocean Plastics-青年科学家解决海洋塑料污染问题

PDF格式公众号回复关键字:ZKSZM006原文 ​ Anna Du was walking along the beach when she noticed plastics there. She reached down to pick them up,and quickly realized there were many more tiny pieces than she could deal with. It seemed i( ) to clean…

Tlias-后端开发

开发规范前后端混合开发沟通成本高 分工不明确:前端发起请求、数据响应的渲染一般都是后端程序员完成的 不便管理 难以维护前后端分离开发产品经理提供界面原型 + 需求,前端/后端分析并设计出接口文档,有了接口文档前端后端就可以并行开发了 接口文档中的接口是功能性接口,…