Trending Articles

25 Feb 2024

Calibrate Before Use: Improving Few-Shot Performance Of Language Models

When developing AI systems with powerful language models, it is essential to calibrate them appropriately before putting them into practice, especially in few-shot learning scenarios. Calibration helps ensure the models are well-aligned with human values and goals. For example, researchers at Anthropic have developed techniques to help calibrate models for safety and robustness through Constitutional AI. It involves training models on empathy, harm avoidance, and accuracy to help them generalize better from just a few examples.

Additionally, organizations like GitHub Copilot have researchers continually monitor models and gather feedback to improve their performance. With careful calibration focusing on alignment, language models can start serving helpfully even with limited training examples and improve as more diverse users guide them. This approach mitigates risks while enabling the benefits of scale in service of humanity.


calibrate before use_ improving few-shot performance of language models

Few-shot learning—the ability to learn tasks with limited examples—is essential to intelligence (Lake et al., 2015; Yogatama et al., 2019). Recent work shows that sizeable neural language models can perform few-shot learning without finetuning (Radford et al., 2019; Brown et al., 2020). Specifically, GPT-3 (Brown et al., 2020) can perform numerous tasks when provided with a few examples in a natural language prompt.

For example, to perform sentiment analysis, one can condition GPT-3 on a prompt such as *Equal contribution 1UC Berkeley 2University of Maryland 3UC Irvine. Correspondence to: Eric Wallace.

Input: Subpar acting. Sentiment: Negative

Input: Beautiful film. Sentiment: Positive

Input: Amazing. Sentiment:

The first two lines correspond to two training examples, and the last line is a test example. To make predictions, the model predicts whether the subsequent token is more likely to be the word “Positive” or “Negative.” This few-shot “in-context” learning style is attractive because it shows that the model can learn without parameter updates.

And, more importantly, it has numerous practical advantages over the now-standard approach of finetuning (Radford et al., 2018; Devlin et al., 2019). First, it allows practitioners to “rapidly prototype” NLP models: changing the prompt immediately leads to a new model.

Second, it provides an entirely natural language interface to a machine learning model, which allows users—even those without technical expertise—to create NLP systems. Finally, since in-context learning reuses the same model for each task, it reduces memory requirements and system complexity when serving many tasks.

Background and Experimental Setup

calibrate before use: improving few-shot performance of language models

Neural autoregressive language models (LMs) take a continuation of tokens as input and output a probability distribution over the next token. Large neural LMs can perform tasks in a zero- or few-shot manner using in-context learning (Radford et al., 2019; Brown et al., 2020). To do so, a natural language prompt is fed into the model.

This prompt contains three components: a format, a set of training examples, and a permutation (ordering) of the training examples. Prompt Format The prompt Format is a template that consists of placeholders for the training and test example(s) and possibly a natural language description of the task.

For example, the Format of the prompt in Section 1 is a template with the style: “Input:” input “Sentiment:” label. Many alternate formats exist, e.g., one could frame the task as question answering. Prompt Training Examples The prompt’s training examples teach the LM how to solve the task.

The prompt from Section 1 consists of two training examples; we refer to this as “two-shot” learning. We also consider “zero-shot” learning, where no training examples exist. Training Example Permutation When training examples are used, they have a particular permutation, e.g., the “Subpar acting” example comes first in the prompt from Section 1.

The permutation matters because neural language models update their hidden states in a left-to-right fashion. To make predictions on an input, we slot it into the test placeholder and generate it from the LM. For example, see the “Amazing.” test example in the prompt from Section 1. For generation tasks, we generate greedily from the LM until it produces a new line character. The classification tasks, the probability for each class is given by the probability assigned to its associated label name, e.g., the words “Negative” and “Positive” for sentiment classification.

What Causes the High Variance?

We next analyze why GPT-3’s accuracy varies across different training examples, permutations, and prompt formats. Concretely, we show that the variance arises because LMs are biased towards outputting answers that are (1) frequent in the prompt (majority label bias), (2) towards the end of the prompt (recency bias), and (3) standard in the pre-training data (common token

bias). Majority Label Bias We find that GPT-3 is biased towards quick answers in the prompt.

A trivial case is when a text classification prompt has a class imbalance, e.g., more Positive than Negative sentiment examples. It is demonstrated in the “unbalanced” region of Figure 4: when one class is more common, GPT-3 2.7B is heavily biased towards predicting that class. Since the SST-2 sentiment analysis dataset is balanced, this bias causes significant accuracy degradations.

The majority label bias also explains why we frequently observe a drop in accuracy when moving from 0-shot to 1-shot—we found that the drop is due to the model continually repeating the class of the one training example. The majority label bias also occurs for generation tasks.


Calibrating language models through techniques like Constitutional AI before deploying them, especially in few-shot learning scenarios, is an essential part of developing AI systems responsibly. Taking the time to properly align models with broader preferences around safety, robustness, and ethics helps mitigate risks as their capabilities continue advancing.

While powerful models have great potential to be helpful with just a few examples, their long-term impact depends on starting grounded in concepts like avoiding harm. Ongoing monitoring and refinement also help address any blind spots. With careful calibration focusing on general value alignment, language models can serve users capably even with limited initial data and continually improve their performance as feedback guides them. This approach, supported by researchers, engineers, and diverse communities, helps ensure AI progress occurs safely and for the benefit of humanity.

Related posts

Leave a Reply

Required fields are marked *