RLHF: Reinforcement Learning using Human Feedback
for Optimization of ChatGPT
Journal:
GRENZE International Journal of Engineering and Technology
Authors:
Pranav K. Dalvi, Kirti Y. Digholkar
Volume:
10
Issue:
2
Grenze ID:
01.GIJET.10.2.92
Pages:
3362-3370
Abstract
Reinforcement learning (RL) is a subfield of machine learning that trains agents to
make decisions in an environment to maximize rewards. While GPT models like ChatGPT are
powerful, they're primarily trained using unsupervised learning on text data, not RL. RL involves
agents interacting with an environment, receiving rewards, and learning to maximize long-term
rewards. RL can be used to train chatbots by defining actions, environments (like simulated
conversations), and rewards based on response quality. In the case of ChatGPT, RL could
potentially be used as a component of a broader training pipeline to fine-tune and optimize its
responses. On comparing various RL algorithms suitable for ChatGPT, we compared various
performance metrics and found that it can be optimized to generate better outputs. As a result,
an algorithm was discovered to make ChatGPT a better version of itself.