Gemma
Overview
Google DeepMind releases Gemma, a series of open language models inspired by the same research and technology used to create Gemini. The Gemma model release includes:
- 2B model: Trained on 2T tokens
- 7B model: Trained on 6T tokens
- Base and instruction-tuned checkpoints
- Context length: 8192 tokens
- Performance: Generally outperforms Llama 2 7B and Mistral 7B on several benchmarks
Architecture
The Gemma model architecture is based on the transformer decoder with improvements including:
- Multi-query attention (used by the 2B model)
- Multi-head attention (used by 7B model)
- RoPE embeddings
- GeGLU activations
- Normalizer location
Training Data
- Gemma 2B: Trained on 2T tokens
- Gemma 7B: Trained on 6T tokens
- Content: Web documents, mathematics, and code
- Language: English only (no explicit multilingual training)
- Multimodal: No explicit multimodal capabilities
Technical Specifications
- Vocabulary size: 256K tokens
- Tokenization: Subset of SentencePiece tokenizer from Gemini
- Features: Preserves whitespace, splits digits, byte-level encodings for unknown tokens
Instruction Tuning
The instruction-tuned models are tuned using:
- Supervised fine-tuning on a mix of text-only synthetic and human-generated prompt-response pairs
- Reinforcement learning from human feedback (RLHF) with:
- Reward model trained on labeled preference data
- Policy based on a set of high-quality prompts
Note: All datasets used are English only.
Control Tokens
The instruction-tuned models use specific formatting control tokens to indicate roles and turns in a conversation:

Results
Performance Capabilities
As shown in the figure below, the Gemma 7B model demonstrates strong performance on math, science, and code-related tasks. The scores correspond to the average scores on academic benchmark evaluations grouped by capability.

Benchmark Comparison
Gemma 7B outperforms Llama 2 7B and Mistral 7B on various academic benchmarks with notable performance on:
- HumanEval
- GSM8K
- MATH
- AGIEval
Improved performance on:
- Reasoning
- Dialogue
- Mathematics
- Code
Safety Evaluation
The Gemma 7B instruction-tuned models also outperform the Mistral-7B v0.2 Instruct model on safety and instruction following as evaluated by humans.

Gemma is evaluated on several safety academic benchmarks and compared with Mistral. The technical report mentions the use of:
- Debiasing techniques
- Red-teaming to potentially mitigate common risks associated with LLMs
Resources: You can find more information on how to responsibly develop with Gemma in the model card and Responsible Generative AI toolkit.
Prompt Format
Base Models
The Gemma base models don't use any specific prompt format but can be prompted to perform tasks through zero-shot/few-shot prompting.
Instruction Model Format
The Gemma Instruct model uses the following format:
<start_of_turn>user
Generate a Python function that multiplies two numbers <end_of_turn>
<start_of_turn>modelControl Tokens Reference
| Context | Relevant Token |
|---|---|
| User turn | user |
| Model turn | model |
| Start of conversation turn | <start_of_turn> |
| End of conversation turn | <end_of_turn> |
Multi-turn Example
You can also use the special control tokens in the context of a multi-turn user prompt:
<start_of_turn>user
What is a good place for travel in the US?<end_of_turn>
<start_of_turn>model
California.<end_of_turn>
<start_of_turn>user
What can I do in California?<end_of_turn>
<start_of_turn>modelHow to Prompt Gemma 7B
Prompting Gemma 7B effectively requires being able to use the prompt template properly. Here are examples demonstrating effective use of the Gemma 7B Instruct prompt template for various tasks.
Zero-shot Prompting
As with any model, you can leverage Gemma's zero-shot capabilities:
<start_of_turn>user
Explain why the sky is blue<end_of_turn>
<start_of_turn>modelZero-shot Prompting with System Prompt
Adding a system role or system prompt helps to steer LLMs better. While there is no explicit system role in Gemma, you can add additional instructions:
<start_of_turn>user
Answer the following question in a concise and informative manner:
Explain why the sky is blue<end_of_turn>
<start_of_turn>modelNote: We added "Answer the following question in a concise and informative manner:" as additional instruction to steer the model better.
Role Playing
We can test Gemma for role-playing capabilities, useful for personalizing model responses and building complex domain-specific LLM agents.
Prompt:
<start_of_turn>user
You are a helpful 2nd-grade teacher. Help a 2nd grader to answer questions in a short and clear manner.
Explain why the sky is blue<end_of_turn>
<start_of_turn>modelModel Response:
Sure, here is an explanation for why the sky is blue:
The sky appears blue because of a phenomenon called Rayleigh Scattering.
Here's the science behind it:
- Sunlight is made up of different colors of light, like red, green, blue, and violet.
- When sunlight hits the Earth's atmosphere, the different colors of light scatter in different directions.
- Blue light scatters more easily than other colors because it has shorter wavelengths.
- The scattered blue light travels in all directions, including towards your eyes.
- That's why we see the sky as blue.
It's important to note that this explanation is simplified and does not account for all scientific details. However, it gives you a good understanding of the basic science behind why the sky appears blue.
Note: We haven't checked the response for accuracy or suitability for a 2nd grader. You can compare this with the chain-of-thought prompt response that follows.
Reasoning
You can elicit Gemma's reasoning capabilities using zero-shot chain-of-thought prompting:
Prompt:
<start_of_turn>user
Think and write your step-by-step reasoning before responding.
Explain why the sky is blue.<end_of_turn>
<start_of_turn>modelModel Response (truncated for brevity):
Step 1: Scattering of Light The sky appears blue due to the phenomenon of scattering of light. When sunlight hits the Earth's atmosphere, it interacts with the particles of air, primarily nitrogen and oxygen molecules.
Step 2: Rayleigh Scattering Rayleigh scattering is the specific scattering mechanism that occurs when light interacts with particles much smaller than the wavelength of light. In this process, shorter wavelengths (blue light) are scattered more effectively than longer wavelengths (red light).
Step 3: Blue Scattered Light Travels in All Directions ...
Conclusion The sky appears blue due to the scattering of light, specifically Rayleigh scattering. Shorter wavelengths of light are scattered more effectively, and this scattered light travels in all directions, with the scattered light that travels opposite to the sun's rays being most visible to our eyes.
Key Features:
- Gemma can output markdown format automatically
- We've edited the text format for demonstration and simplicity
- Content is exactly what the model responded with
- We haven't assessed accuracy or potential hallucination
Resources and Integrations
Here are several resources and integrations that were part of the Gemma release:
- Colab and Kaggle notebooks
- Hugging Face models
- MaxText
- NVIDIA NeMo
- TensorRT-LLM
- NVIDIA AI Playground (Gemma 7B available)
Commercial Usage
According to the official blog release, the Terms of Use permit responsible commercial usage and distribution for all organizations, regardless of size.
Key Takeaways
- Open Source: Fully open language models based on Gemini research
- Strong Performance: Outperforms comparable models on academic benchmarks
- Safety Focus: Comprehensive safety evaluation and debiasing techniques
- Flexible Prompting: Support for zero-shot, few-shot, and role-playing
- Reasoning Capabilities: Chain-of-thought prompting for complex reasoning
- Commercial Ready: Permits responsible commercial usage
- Multiple Integrations: Available across major AI platforms
References
- Gemma: Introducing new state-of-the-art open models
- Gemma: Open Models Based on Gemini Research and Technology
- Responsible Generative AI Toolkit
- Fast Transformer Decoding: One Write-Head is All You Need
- Roformer: Enhanced transformer with rotary position embedding
- GLU variants improve transformer
- Root mean square layer normalization
