Phi-2

Overview

In this guide, we provide an overview of Phi-2, a 2.7 billion parameter language model, how to prompt Phi-2, and its capabilities. This guide also includes tips, applications, limitations, important references, and additional reading materials related to Phi-2 LLM.

Phi-2 Introduction

Phi-2 is the latest small language model (SLM) released by Microsoft Research. Phi-2 follows the previous Phi-1 model and Phi-1.5 models.

Model Evolution

Phi-1

Parameters: 1.3 billion parameters
Training Data: "Textbook quality" data from the web (6B tokens) + synthetically generated textbooks and exercises with GPT-3.5 (1B tokens)
Performance: Excels at Python code generation tasks
Reference: (Gunasekar et al. 2023)

Phi-1.5

Focus: Common sense reasoning and language understanding capabilities
Capabilities: Complex reasoning tasks such as grade-school mathematics and basic coding tasks
Performance: Comparable to models 5 times larger

Phi-2

Parameters: 2.7 billion parameters
Improvements: Enhanced reasoning and language understanding capabilities
Performance: Outperforms models up to 25x larger
License: MIT License (usable in commercial settings)

Phi-2 Insights & Evaluation

Research Focus

LLM researchers are keen to explore whether small language models have similar emergent capabilities as their large counterparts and if there are techniques for training that can help achieve this.

Training Details

Data: "Textbook-quality" data (1.4 trillion tokens with multiple passes)
Synthetic Datasets: Help teach common sense reasoning and general knowledge
Augmentation: Educational and high-quality web content
Training Time: 14 days on 96 A100 GPUs
Fine-tuning: No additional RLHF or instruction tuning applied

Knowledge Transfer

Phi-1.5 knowledge is transferred to Phi-2, which helps in:

Model convergence
Performance boost across several benchmarks

Performance Comparison

The figure below demonstrates the performance comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) on:

Common sense reasoning
Math reasoning
Code generation
Other language understanding benchmarks

Note: All tasks are evaluated using 0-shot, except BBH (3-shot CoT) and MMLU (5-shot).

Phi-2 LLM Performance & Benchmarks

Safety Performance

Safety Comparison

While the model wasn't aligned with special techniques like RLHF, it's reported to be safer in terms of toxicity and bias compared to the aligned open-source Llama2-7b. The authors attribute this to data curation efforts.

Phi-2 Safety Performance

Benchmark Performance

As shown in the figure below, Phi-2 outperforms:

Mistral 7B
Llama 2 (13B)
Llama-2-70B on multi-step reasoning
Google's Gemini Nano 2

Phi-2 Performance Comparison

Capability Examples

Below, we provide several prompt examples that demonstrate the capabilities of the Phi-2 model on various tasks.

Physics Problem Solving

The example below demonstrates the ability of Phi-2 to solve physics word problems:

Phi-2 Physics Problem Solving

Error Identification

The example below demonstrates the ability of Phi-2 to identify mistakes in a student's calculation for a physics word problem:

Phi-2 Verifying and Correcting

Phi-2 Usage

According to the model page, Phi-2 can be prompted using three formats:

QA format
Chat format
Code format

Below we demonstrate how to effectively use these prompt templates in different scenarios.

QA Format

Use case: Scenarios where you ask the model a question and want a concise answer in return.

Prompt Template:

Instruct: {{prompt}}
Output:

Example

Prompt:

Instruct: What is the difference between data and information?
Output:

Output:

Great question! Data refers to raw facts, statistics, or numbers that have not been analyzed or processed into any meaningful context. On the other hand, information is data that has been organized, analyzed, and interpreted in a way that it can be used by humans to make informed decisions or take actions. In other words, data is raw material, while information is processed and useful knowledge.

Chat Format

The chat format used by Phi-2 is a standard format. Here's a simple example:

Prompt:

Human: Hello, who are you?
AI: Greetings! I am an AI research assistant. How can I help you today?
Human: Can you tell me about the creation of black holes?
AI:

Output:

Black holes form when a massive star runs out of fuel and collapses under its own gravitational pull. The resulting object is so dense that it distorts space-time around it, causing anything that comes too close to be pulled in by the immense gravity. This includes light, which cannot...

Code Format

Below is a code generation prompt template that provides the name of the function to the model:

Prompt:

def multiply(a,b):

Important Note: The model has been trained on limited Python code examples and libraries, so it's quite limited in its code generation capabilities and language support.

Phi-2 Limitations

Below is a summary of limitations of Phi-2, as reported by the authors:

Content Generation

Inaccuracy: Similar to other models, Phi-2 may generate inaccurate code and statements
Instruction Following: Not instruction-tuned like other models, may struggle to follow instructions
Verbosity: Tends to generate verbose responses, sometimes producing irrelevant extra text

Language and Comprehension

Language Support: Training consists of standard English; may struggle with slang and fail to comprehend instructions from other languages
Training Data: Primarily textbooks, which may contribute to verbose responses

Safety and Bias

Societal Biases: May produce societal biases and toxic content
Safety: While improved over previous versions, still has limitations

Note: The authors suggest that verbosity is probably due to the nature of the training dataset, which is primarily textbooks.

Figure Sources: Microsoft Research

Key Takeaways

Small but Powerful: 2.7B parameters outperforming models up to 25x larger
Textbook Quality Training: 1.4T tokens of high-quality educational content
Strong Reasoning: Excels at common sense reasoning and mathematics
Safety Improvements: Better toxicity and bias performance than comparable models
Multiple Formats: Supports QA, chat, and code generation
Commercial Ready: MIT License for commercial use
Limitations: Limited code generation, instruction following challenges, potential verbosity

Phi-2 ​

Overview ​

Phi-2 Introduction ​

Model Evolution ​

Phi-1 ​

Phi-1.5 ​

Phi-2 ​

Phi-2 Insights & Evaluation ​

Research Focus ​

Training Details ​

Knowledge Transfer ​

Performance Comparison ​

Safety Performance ​

Safety Comparison ​

Benchmark Performance ​

Capability Examples ​

Physics Problem Solving ​

Error Identification ​

Phi-2 Usage ​

QA Format ​

Example ​

Chat Format ​

Code Format ​

Phi-2 Limitations ​

Content Generation ​

Language and Comprehension ​

Safety and Bias ​

Key Takeaways ​

References ​

Related Topics ​

Phi-2

Overview

Phi-2 Introduction

Model Evolution

Phi-1

Phi-1.5

Phi-2

Phi-2 Insights & Evaluation

Research Focus

Training Details

Knowledge Transfer

Performance Comparison

Safety Performance

Safety Comparison

Benchmark Performance

Capability Examples

Physics Problem Solving

Error Identification

Phi-2 Usage

QA Format

Example

Chat Format

Code Format

Phi-2 Limitations

Content Generation

Language and Comprehension

Safety and Bias

Key Takeaways

References

Related Topics