Skip to content

Few-Shot Prompting

Overview

While large-language models demonstrate remarkable zero-shot capabilities, they still fall short on more complex tasks when using the zero-shot setting. Few-shot prompting can be used as a technique to enable in-context learning where we provide demonstrations in the prompt to steer the model to better performance. The demonstrations serve as conditioning for subsequent examples where we would like the model to generate a response.

According to Touvron et al. 2023 few shot properties first appeared when models were scaled to a sufficient size (Kaplan et al., 2020).

Basic Example

Let's demonstrate few-shot prompting via an example that was presented in Brown et al. 2020. In the example, the task is to correctly use a new word in a sentence.

Prompt:

A "whatpu" is a small, furry animal native to Tanzania. An example of a sentence that uses the word whatpu is:
We were traveling in Africa and we saw these very cute whatpus.
 
To do a "farduddle" means to jump up and down really fast. An example of a sentence that uses the word farduddle is:

Output:

When we won the game, we all started to farduddle in celebration.

We can observe that the model has somehow learned how to perform the task by providing it with just one example (i.e., 1-shot). For more difficult tasks, we can experiment with increasing the demonstrations (e.g., 3-shot, 5-shot, 10-shot, etc.).

Best Practices for Demonstrations

Following the findings from Min et al. (2022), here are a few more tips about demonstrations/exemplars when doing few-shot:

  1. Label Distribution: "the label space and the distribution of the input text specified by the demonstrations are both important (regardless of whether the labels are correct for individual inputs)"
  2. Format Consistency: the format you use also plays a key role in performance, even if you just use random labels, this is much better than no labels at all.
  3. Balanced Selection: additional results show that selecting random labels from a true distribution of labels (instead of a uniform distribution) also helps.

Examples with Random Labels

Let's try out a few examples. Let's first try an example with random labels (meaning the labels Negative and Positive are randomly assigned to the inputs):

Prompt:

This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! //

Output:

Negative

We still get the correct answer, even though the labels have been randomized. Note that we also kept the format, which helps too. In fact, with further experimentation, it seems the newer GPT models we are experimenting with are becoming more robust to even random formats.

Prompt:

Positive This is awesome! 
This is bad! Negative
Wow that movie was rad!
Positive
What a horrible show! --

Output:

Negative

There is no consistency in the format above but the model still predicted the correct label. We have to conduct a more thorough analysis to confirm if this holds for different and more complex tasks, including different variations of prompts.

Limitations of Few-shot Prompting

Standard few-shot prompting works well for many tasks but is still not a perfect technique, especially when dealing with more complex reasoning tasks. Let's demonstrate why this is the case. Do you recall the previous example where we provided the following task:

The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

If we try this again, the model outputs the following:

Yes, the odd numbers in this group add up to 107, which is an even number.

This is not the correct response, which not only highlights the limitations of these systems but that there is a need for more advanced prompt engineering.

Let's try to add some examples to see if few-shot prompting improves the results.

Prompt:

The odd numbers in this group add up to an even number: 4, 8, 9, 15, 12, 2, 1.
A: The answer is False.
The odd numbers in this group add up to an even number: 17,  10, 19, 4, 8, 12, 24.
A: The answer is True.
The odd numbers in this group add up to an even number: 16,  11, 14, 4, 8, 13, 24.
A: The answer is True.
The odd numbers in this group add up to an even number: 17,  9, 10, 12, 13, 4, 2.
A: The answer is False.
The odd numbers in this group add up to an even number: 15, 32, 5, 13, 82, 7, 1. 
A:

Output:

The answer is True.

That didn't work. It seems like few-shot prompting is not enough to get reliable responses for this type of reasoning problem. The example above provides basic information on the task. If you take a closer look, the type of task we have introduced involves a few more reasoning steps. In other words, it might help if we break the problem down into steps and demonstrate that to the model. More recently, chain-of-thought (CoT) prompting has been popularized to address more complex arithmetic, commonsense, and symbolic reasoning tasks.

Key Takeaways

Overall, it seems that providing examples is useful for solving some tasks. When zero-shot prompting and few-shot prompting are not sufficient, it might mean that whatever was learned by the model isn't enough to do well at the task. From here it is recommended to start thinking about fine-tuning your models or experimenting with more advanced prompting techniques.

Key Benefits

  • Improved Performance: Better results on complex tasks compared to zero-shot
  • In-Context Learning: Models learn from examples provided in the prompt
  • Format Flexibility: Works with various input-output formats
  • Scalable Approach: Can increase examples for more difficult tasks

Applications

  • Text Classification: Sentiment analysis, topic classification
  • Language Translation: Providing translation examples
  • Code Generation: Showing code patterns and examples
  • Question Answering: Demonstrating answer formats
  • Creative Writing: Providing style examples

References

  • Brown et al. (2020) - Language Models are Few-Shot Learners
  • Min et al. (2022) - Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
  • Touvron et al. (2023) - LLaMA: Open and Efficient Foundation Language Models