Constitutional AI

Constitutional AI is a method for training AI systems using a set of rules or principles that act as a “constitution” for the AI system. This approach allows the AI system to operate within a societally accepted framework and aligns it with human intentions. Some benefits of using Constitutional AI include allowing a model to explain why it is refusing to provide an answer, improving transparency of AI decision making, and controlling AI behavior more precisely with fewer human labels.

  • Constitutional AI is a technique that aims to imbue systems with “values” defined by a “constitution”³.
  • This makes the behavior of systems both easier to understand and simpler to adjust as needed³.
RL from AI Feedback' (RLAIF)

It is a process that involves training a preference model from a dataset of AI preferences and then using that preference model as the reward signal for training with reinforcement learning. RLAIF is a method for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so the method is referred to as ‘Constitutional AI’. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase, an initial model is sampled from, then self-critiques and revisions are generated, and then the original model is finetuned on revised responses. In the RL phase, samples are taken from the finetuned model and a model is used to evaluate which of the two samples is better. A preference model is then trained from this dataset of AI preferences. The preference model is used as the reward signal for training with RL.