In Context Learning (ICL)

What is In Context Learning (ICL)?

In-context learning (ICL) is a specific method of prompt engineering where demonstrations of the task are provided to the model as part of the prompt (in natural language). With ICL, you can use off-the-shelf large language models (LLMs) to solve novel tasks without the need for fine-tuning. ICL can also be combined with fine-tuning for more powerful LLMs.

The main types of machine learning (supervised ML, unsupervised ML, semi-supervised ML, and reinforcement learning) can only learn with data they are trained on. That is, they can only solve tasks that they are trained to solve. LLMs that are large enough have shown a new type of machine learning - in-context learning - the ability to learn to solve new tasks by providing “training” examples in the prompt. In contrast to the aforementioned types of ML, the newly learnt skill is forgotten directly after the LLM sends its response - model weights are not updated.

In-context learning (ICL) learns a new task from a small set of examples presented within the context (the prompt) at inference time. LLMs trained on sufficient data exhibit ICL, even though they are trained only with the objective of next token prediction. Much of the interest in LLMs is due to the prompting with examples as it enables applications on novel tasks without the need for fine-tuning the LLM.

How to Engineer Prompts for In-Context Learning

Imagine a recipe generation service where you enter what ingredients you have available, and ask the service to generate a recipe for you. One way to implement this service would be with prompts prefixed with example recipes before your text with your available ingredients is finally added to the prompt. For this, you may have thousands of recipes indexed in a VectorDB. When the query arrives, you use the ingredients to look up the most relevant recipes in the VectorDB, then paste them in at the start of the prompt, and then write the list of available ingredients, and finally, ask your LLM to generate a prompt. This is an example of retrieval-augmented generation for LLMs.

The above service follows one piece of advice on prompt engineering - add the most relevant context at the beginning or the end of a prompt to improve the performance of LLMs. Researchers have shown that adding relevant context (i.e., the question of what recipe to generate) in the middle of the prompt leads to worse performance.

How many example recipes can you add to your prompt? That depends on the context window size for your model. GPT-4 can include ~50 pages of input text (32KB of data), and inference time slows down with larger input prompts.

Zero-Shot vs One Shot vs Few Shot Learning

Reynolds and McDonell describe fews-shot prompts as where a small number of examples of solved tasks are provided as part of the input to the trained LLM. The goal is to make predictions for new classes based on just a few examples.

Zero-Shot Inference

The following example uses zero prompts as context.

Prompt: Who won the 2014 World Cup?

GPT-4 Response: Germany won the 2014 FIFA World Cup. 
The tournament was held in Brazil, and Germany defeated 
Argentina 1-0 in the final, with the match-winning goal 
scored by Mario Götze during extra time

The zero-shot completion includes additional information about the match that I didn't really want.
Can we use in-context learning to teach the model the format of the response I am looking for?

One-Shot Learning/Inference

The following example adds an instruction and one-shot prompt to the context:

Prompt:  Answer the question using the format 
shown in the context.
Who won the World Cup in 2014?
Germany won the 2014 FIFA World Cup. They defeated 
Argentina in the final. It was a low scoring game.
Who won the world cup in 2018?

GPT-4 Response: France won the 2018 FIFA World Cup. 
They defeated Croatia in the final. It was a higher 
scoring game compared to the 2014 final.

Few-Shot Learning/Inference

The following example adds a few-shot prompt to the context. You can give more examples to ensure that the response by your LLM is more consistent.

Prompt: Answer the question using the format 
shown in the context.
Who won the World Cup in 2014?
Germany won the 2014 FIFA World Cup. They defeated 
Argentina in the final. It was a low scoring game.
Who won the World Cup in 2018?
France won the 2018 FIFA World Cup. They defeated 
Croatia in the final. It was a high scoring game.
Who won the world cup in 2022?

GPT-4 Response: Argentina won the 2022 FIFA World Cup. 
They defeated France in the final. It was a high scoring game.

Model Size and Context Window Size for ICL

In-context learning benefits from larger context window sizes, as it makes it easier for ICL with prompts. Similarly, ICL can benefit from LLMs with more parameters. For example, in GPT-4, with a large 32K context window size, Kosinski showed that GPT-4 could solve 95% of a set of 40 classic false-belief tasks widely used to test Theory-of-Mind (ToM) in humans. In contrast, GPT-3 has a smaller model (up to 1000 times smaller than GPT-4) with a context window size of 2K, and it could only solve 40% of the false-belief tasks. Similarly, Levenstein et al in LLaMA 30b, a LLM with only 10s of billions of parameters and a smaller context window size, could not show ability to solve ToM problems.

Is In-Context Learning Real?

Yes, in this paper by Raventós et al, where they study ICL for linear regression and each task corresponds to a different latent regression vector, as pre-training task diversity increases beyond a threshold, transformer models outperform Bayesian estimators on unseen tasks. The implication is that ICL is an emergent phenomenon, as their transformer model moves beyond memorization of the pretraining tasks when there is sufficient diversity and scale in pre-training data. With ICL, transformers can solve new tasks not seen during pre-training.

‍Why does ICL work?

Informally, Charles Fyre declares, prompting is mostly subtractive, we delete potential words with each input. ICL is more about defining the task than about [learning] it. However, other researchers disagree and believe the LLMs can learn in a single shot.

Dileep et al speculate that schema learning and rebinding are key mechanisms - they believe that ICL can be explained as schema learning, and simultaneously inferring the slot-filling and latent template from the prompt.

As Dileep explains “fast rebinding is a special case of expectation maximization, where the updates are local to surprising tokens found during inference. Most of the content of the latent circuit remains ‘anchored’ to prior training, while some ‘slots’ are filled on the fly with rebinding”. In contrast, Xie et al speculate that implicit Bayesian inference is the main mechanism at work, although Dileep et al show that this is not enough to solve the “dax test” like novel word usage.

Resources

Our research paper, "The Hopsworks Feature Store for Machine Learning", is the first feature store to appear at the top-tier database or systems conference SIGMOD 2024. This article series is describing in lay terms concepts and results from this study.