Difference between revisions of "Reinforcement Learning (RL) from Human Feedback (RLHF)"

From
Jump to: navigation, search
m
m
 
(43 intermediate revisions by the same user not shown)
Line 2: Line 2:
 
|title=PRIMO.ai
 
|title=PRIMO.ai
 
|titlemode=append
 
|titlemode=append
|keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Google, Nvidia, Microsoft, Azure, Amazon, AWS  
+
|keywords=artificial, intelligence, machine, learning, models, algorithms, data, singularity, moonshot, Tensorflow, Facebook, Meta, Google, Nvidia, Microsoft, Azure, Amazon, AWS  
 
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
 
|description=Helpful resources for your journey with artificial intelligence; videos, articles, techniques, courses, profiles, and tools  
 
}}
 
}}
[http://www.youtube.com/results?search_query=Reinforcement+Human+Feedback+RLHFMachine+Learning YouTube search...]
+
[https://www.youtube.com/results?search_query=ai+Reinforcement+Human+Feedback+RLHF YouTube]
[http://www.google.com/search?q=Reinforcement+Human+Feedback+RLHFMachine+Learning ...Google search]
+
[https://www.quora.com/search?q=ai%20Reinforcement%20Human%20Feedback%20XRLHF ... Quora]
 +
[https://www.google.com/search?q=ai+Reinforcement+Human+Feedback+RLHF ...Google search]
 +
[https://news.google.com/search?q=ai+Reinforcement+Human+Feedback+RLHF ...Google News]
 +
[https://www.bing.com/news/search?q=ai+Reinforcement+Human+Feedback+RLHF&qft=interval%3d%228%22 ...Bing News]
  
* [[ChatGPT]]
 
 
* [[Reinforcement Learning (RL)]]
 
* [[Reinforcement Learning (RL)]]
* [[Generative Pre-trained Transformer (GPT)]]
+
* [[Human-in-the-Loop (HITL) Learning]]
* [https://huggingface.co/blog/rlhf Illustrating Reinforcement Learning from Human Feedback (RLHF) | N. Lambert, L. Castricato, L. von Werra, and A. Havrilla - Hugging Face]
+
* [[Agents]] ... [[Robotic Process Automation (RPA)|Robotic Process Automation]] ... [[Assistants]] ... [[Personal Companions]] ... [[Personal Productivity|Productivity]] ... [[Email]] ... [[Negotiation]] ... [[LangChain]]
 +
* [[What is Artificial Intelligence (AI)? | Artificial Intelligence (AI)]] ... [[Generative AI]] ... [[Machine Learning (ML)]] ... [[Deep Learning]] ... [[Neural Network]] ... [[Reinforcement Learning (RL)|Reinforcement]] ... [[Learning Techniques]]
 +
* [[Conversational AI]] ... [[ChatGPT]] | [[OpenAI]] ... [[Bing/Copilot]] | [[Microsoft]] ... [[Gemini]] | [[Google]] ... [[Claude]] | [[Anthropic]] ... [[Perplexity]] ... [[You]] ... [[phind]] ... [[Grok]] | [https://x.ai/ xAI] ... [[Groq]] ... [[Ernie]] | [[Baidu]]
 +
* [[Policy]]  ... [[Policy vs Plan]] ... [[Constitutional AI]] ... [[Trust Region Policy Optimization (TRPO)]] ... [[Policy Gradient (PG)]] ... [[Proximal Policy Optimization (PPO)]]
 +
* [https://www.surgehq.ai/blog/introduction-to-reinforcement-learning-with-human-feedback-rlhf-series-part-1 Introduction to Reinforcement Learning with Human Feedback | Edwin Chen - Surge]
 +
* [https://aisupremacy.substack.com/p/what-is-reinforcement-learning-with What is Reinforcement Learning with Human Feedback (RLHF)? | Michael Spencer]
 +
* [https://www.lesswrong.com/posts/d6DvuCKH5bSoT62DB/long-list-of-problems-with-reinforcement-learning-from-human-1 Compendium of problems with RLHF | Raphael S - LessWrong]
 +
* [https://medium.com/@sthanikamsanthosh1994/reinforcement-learning-from-human-feedback-rlhf-532e014fb4ae Reinforcement Learning from Human Feedback(RLHF)-ChatGPT | Sthanikam Santhosh - Medium]
 +
* [https://www.deepmind.com/blog/learning-through-human-feedback Learning through human feedback |] [[Google]] DeepMind
 +
* [https://pub.towardsai.net/paper-review-summarization-using-reinforcement-learning-from-human-feedback-e000a66404ff Paper Review: Summarization using Reinforcement Learning From Human Feedback | - Towards AI] ... AI Alignment, Reinforcement Learning from Human Feedback, [[ Proximal Policy Optimization (PPO)]]
 +
 
 +
 
 +
<hr>
 +
[https://arxiv.org/abs/1706.03741 Deep reinforcement learning from human preferences | P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei]
 +
<hr>
 +
 
 +
 
 +
<img src="https://preview.redd.it/fp5mh1sdayca1.png?width=2324&format=png&auto=webp&v=enabled&s=30fce8e48088730461253f0b94ac1f01673475b0" width="800">
 +
 
 +
[https://gist.github.com/JoaoLages/c6f2dfd13d2484aa8bb0b2d567fbf093 Reinforcement Learning from Human Feedback (RLHF) - a simplified explanation | Joao Lages]
  
  
 
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/rlhf/rlhf.png" width="800">
 
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/rlhf/rlhf.png" width="800">
 +
 +
 +
[https://huggingface.co/blog/rlhf Illustrating Reinforcement Learning from Human Feedback (RLHF) | N. Lambert, L. Castricato, L. von Werra, and A. Havrilla -] [[Hugging Face]
  
 
{|<!-- T -->
 
{|<!-- T -->
Line 30: Line 54:
 
* [https://twitter.com/thomassimonini Thomas Twitter]  
 
* [https://twitter.com/thomassimonini Thomas Twitter]  
  
Nathan Lambert is a Research Scientist at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research. He was lucky to intern at Facebook AI and DeepMind during his Ph.D. Nathan was was awarded the UC Berkeley EECS Demetri Angelakos Memorial Achievement Award for Altruism for his efforts to better community norms.
+
Nathan Lambert is a Research Scientist at [[HuggingFace]]. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research. He was lucky to intern at [[Meta|Facebook]] AI and DeepMind during his Ph.D. Nathan was was awarded the UC Berkeley EECS Demetri Angelakos Memorial Achievement Award for Altruism for his efforts to better community norms.
 
|}
 
|}
 
|<!-- M -->
 
|<!-- M -->
Line 38: Line 62:
 
<youtube>wA8rjKueB3Q</youtube>
 
<youtube>wA8rjKueB3Q</youtube>
 
<b>How ChatGPT works - From Transformers to Reinforcement Learning with Human Feedback (RLHF)
 
<b>How ChatGPT works - From Transformers to Reinforcement Learning with Human Feedback (RLHF)
</b><br>ChatGPT has recently been released by OpenAI, and it is fundamentally a next token/word prediction model. Given the prompt, predict the next token/word(s). When trained on a massive internet corpus, it manages to be very powerful and can do many tasks like summarization, code completion, question and answer zero-shot.  
+
</b><br>ChatGPT has recently been released by [[OpenAI]], and it is fundamentally a next token/word prediction model. Given the prompt, predict the next token/word(s). When trained on a massive internet corpus, it manages to be very powerful and can do many tasks like summarization, code completion, question and answer zero-shot.  
  
 
Amidst the hype of ChatGPT, it can be easy to assume that the model can reason and think for itself. Here, we try to demystify how the model works, first starting with a basic introduction of Transformers, and then how we can improve the model's output using Reinforcement Learning with Human Feedback (RLHF).
 
Amidst the hype of ChatGPT, it can be easy to assume that the model can reason and think for itself. Here, we try to demystify how the model works, first starting with a basic introduction of Transformers, and then how we can improve the model's output using Reinforcement Learning with Human Feedback (RLHF).
Line 49: Line 73:
 
* [https://arxiv.org/pdf/1706.03762.pdf Original Transformer Paper (Attention is all you need)]
 
* [https://arxiv.org/pdf/1706.03762.pdf Original Transformer Paper (Attention is all you need)]
 
* [https://arxiv.org/pdf/2005.14165.pdf GPT Paper]  
 
* [https://arxiv.org/pdf/2005.14165.pdf GPT Paper]  
* [https://arxiv.org/pdf/1911.00536.pdf DialoGPT Paper (conversational AI by Microsoft)  
+
* [https://arxiv.org/pdf/1911.00536.pdf DialoGPT Paper (conversational AI by] [[Microsoft]])  
 
* [https://arxiv.org/pdf/2203.02155.pdf InstructGPT Paper (with RLHF)]  
 
* [https://arxiv.org/pdf/2203.02155.pdf InstructGPT Paper (with RLHF)]  
  
Line 55: Line 79:
 
* [https://jalammar.github.io/illustrated-transformer/ Illustrated Transformer]
 
* [https://jalammar.github.io/illustrated-transformer/ Illustrated Transformer]
 
* [https://jalammar.github.io/illustrated-gpt2/ Illustrated GPT-2]  
 
* [https://jalammar.github.io/illustrated-gpt2/ Illustrated GPT-2]  
 +
  
 
* 0:00 Introduction
 
* 0:00 Introduction
* 3:09 Embedding Space
+
* 3:09 [[Embedding]] Space
 
* 15:35 Overall Transformer Architecture
 
* 15:35 Overall Transformer Architecture
 
* 36:06 Transformer (Details)
 
* 36:06 Transformer (Details)
Line 66: Line 91:
 
* 1:19:00 Reinforcement Learning from Human Feedback (RLHF)
 
* 1:19:00 Reinforcement Learning from Human Feedback (RLHF)
 
* 1:45:15 Discussion
 
* 1:45:15 Discussion
 
08:24, 29 January 2023 (EST)08:24, 29 January 2023 (EST)08:24, 29 January 2023 (EST)08:24, 29 January 2023 (EST)08:24, 29 January 2023 (EST)[[User:BPeat|BPeat]] ([[User talk:BPeat|talk]]) 08:24, 29 January 2023 (EST)`
 
  
 
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
 
AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.
  
Online AI blog: https://delvingintotech.wordpress.com/.
+
* [https://delvingintotech.wordpress.com/ Online AI blog]
LinkedIn: https://www.linkedin.com/in/chong-min...
+
* [https://www.linkedin.com/in/chong-min-tan-94652288/ LinkedIn]
Twitch: https://www.twitch.tv/johncm99
+
* [https://www.twitch.tv/johncm99 Twitch]
Twitter: https://twitter.com/johntanchongmin
+
* [https://twitter.com/johntanchongmin Twitter]
Try out my games here: https://simmer.io/@chongmin
+
* [https://simmer.io/@chongmin Try out my games here]
 
|}
 
|}
 
|}<!-- B -->
 
|}<!-- B -->
 +
<youtube>bSvTVREwSNw</youtube>

Latest revision as of 20:18, 9 April 2024

YouTube ... Quora ...Google search ...Google News ...Bing News



Deep reinforcement learning from human preferences | P. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei



Reinforcement Learning from Human Feedback (RLHF) - a simplified explanation | Joao Lages



Illustrating Reinforcement Learning from Human Feedback (RLHF) | N. Lambert, L. Castricato, L. von Werra, and A. Havrilla - [[Hugging Face]

Reinforcement Learning from Human Feedback: From Zero to ChatGPT
In this talk, we will cover the basics of Reinforcement Learning from Human Feedback (RLHF) and how this technology is being used to enable state-of-the-art ML tools like ChatGPT. Most of the talk will be an overview of the interconnected ML models and cover the basics of Natural Language Processing and Reinforcement Learning (RL) that one needs to understand how Reinforcement Learning (RL) from Human Feedback (RLHF) is used on large language models. It will conclude with open question in RLHF.

Nathan Lambert is a Research Scientist at HuggingFace. He received his PhD from the University of California, Berkeley working at the intersection of machine learning and robotics. He was advised by Professor Kristofer Pister in the Berkeley Autonomous Microsystems Lab and Roberto Calandra at Meta AI Research. He was lucky to intern at Facebook AI and DeepMind during his Ph.D. Nathan was was awarded the UC Berkeley EECS Demetri Angelakos Memorial Achievement Award for Altruism for his efforts to better community norms.

How ChatGPT works - From Transformers to Reinforcement Learning with Human Feedback (RLHF)
ChatGPT has recently been released by OpenAI, and it is fundamentally a next token/word prediction model. Given the prompt, predict the next token/word(s). When trained on a massive internet corpus, it manages to be very powerful and can do many tasks like summarization, code completion, question and answer zero-shot.

Amidst the hype of ChatGPT, it can be easy to assume that the model can reason and think for itself. Here, we try to demystify how the model works, first starting with a basic introduction of Transformers, and then how we can improve the model's output using Reinforcement Learning with Human Feedback (RLHF).

Slides and code here

Transformer Introduction here

References:



  • 0:00 Introduction
  • 3:09 Embedding Space
  • 15:35 Overall Transformer Architecture
  • 36:06 Transformer (Details)
  • 49:28 GPT Architecture
  • 56:38 GPT Training and Loss Function
  • 1:05:25 Live Demo of GPT Next Token Generation and Attention Visualisation
  • 1:16:55 Conversational AI
  • 1:19:00 Reinforcement Learning from Human Feedback (RLHF)
  • 1:45:15 Discussion

AI and ML enthusiast. Likes to think about the essences behind breakthroughs of AI and explain it in a simple and relatable way. Also, I am an avid game creator.