Gemini 1.5 Pro

Overview

Google introduces Gemini 1.5 Pro, a compute-efficient multimodal mixture-of-experts model. This AI model focuses on capabilities such as recalling and reasoning over long-form content. Gemini 1.5 Pro can reason over long documents potentially containing millions of tokens, including hours of video and audio.

Key Capabilities

Long-form content processing: Millions of tokens, hours of video and audio
State-of-the-art performance in long-document QA, long-video QA, and long-context ASR
Matches or outperforms Gemini 1.0 Ultra across standard benchmarks
Near-perfect retrieval (>99%) up to at least 10 million tokens
1 million token context window available in Google AI Studio

Context Window Comparison

200K tokens: Largest context window of any available LLM to date
1 million tokens: New experimental capability in Google AI Studio
10 million tokens: Maximum demonstrated capability

Architecture

Gemini 1.5 Pro is a sparse mixture-of-experts (MoE) Transformer based model built on Gemini 1.0's multimodal capabilities.

Key Features

MoE Architecture: Total parameters can grow while keeping activated parameters constant
Efficient Training: Significantly less training compute required
Efficient Serving: More efficient to serve than previous models
Long-context Understanding: Architecture changes enable processing up to 10 million tokens
Multimodal Pre-training: Trained on different modalities with instruction tuning
Human Preference Tuning: Further refined based on human preference data

Results

Retrieval Performance

Gemini 1.5 Pro achieves near-perfect "needle" recall of up to 1 million tokens in all modalities (text, video, and audio).

Context Window Capabilities

Gemini 1.5 Pro can process and maintain recall performance when extending to:

~22 hours of recordings
10 x 1440 pages book
Entire codebases
3 hours of video at 1 fps

Gemini 1.5 Pro Retrieval Results

Benchmark Performance

Gemini 1.5 Pro surpasses Gemini 1.0 Pro on the majority of benchmarks with significant performance in:

Math
Science
Reasoning
Multilinguality
Video Understanding
Code

Gemini 1.5 Pro Results

Note: Gemini 1.5 Pro also outperforms Gemini 1.0 Ultra on half of the benchmarks despite using significantly less training compute.

Capabilities

Long Document Analysis

Basic Question Answering

To demonstrate Gemini 1.5 Pro abilities to process and analyze documents, we start with a very basic question answering task. The Gemini 1.5 Pro model in the Google AI Studio supports up to 1 million tokens so we are able to upload entire PDFs.

Example: Upload a PDF and ask "What is the paper about?"

Gemini 1.5 Pro Document Analysis

The model's response is accurate and concise as it provides an acceptable summary of the Galactica paper.

Chat Format Interaction

You can also use the chat format to interact with an uploaded PDF. This is a useful feature if you have many questions you would like to get answered from the provided document/s.

Gemini 1.5 Pro Chat

Cross-Document Analysis

To leverage the long context window, let's now upload two PDFs and ask a question that spans both PDFs.

Gemini 1.5 Pro Cross-Document

Key Insights:

Information extracted from the first paper (survey on LLMs) comes from a table
"Architecture" information looks correct
"Performance" section doesn't belong there (not found in first paper)

Best Practices:

Put the prompt at the top
Label papers with tags (Paper 1, Paper 2)
Be specific about which paper to reference

Video Understanding

Gemini 1.5 Pro is trained with multimodal capabilities from the ground up and demonstrates strong video understanding capabilities.

Video understanding

Lecture Summarization

We tested with Andrej Karpathy's recent lecture on LLMs.

Task 1: "What is the lecture about?"

Response accurately summarizes the lecture

Task 2: "Provide a concise outline (one-page length) of the lecture"

Response (edited for brevity):

Large Language Models: A New Operating System?

Introduction

LLMs are trained on massive amounts of text data and can generate human-quality text
They are becoming increasingly powerful and are being used for a wide range of tasks

How LLMs Work

LLMs are neural networks that predict the next word in a sequence
They learn by iteratively adjusting their parameters to make better predictions
While we can measure their performance, we don't fully understand how the billions of parameters collaborate

Training LLMs

Requires massive amounts of data and computing power
Can be thought of as "compressing the internet" into a single file
For example, training LLaMA 2 70B required 6,000 GPUs for 12 days and cost ~$2 million

LLM Security

LLM security is a new and rapidly evolving field
Key security risks include jailbreaking, prompt injection, and data poisoning

Specific Detail Extraction

Example: "What are the FLOPs reported for Llama 2 in the lecture?"

Response: "The lecture reports that training Llama 2 70B required approximately 1 trillion FLOPs."

Note: This is not accurate. The correct response should be ~1e24 FLOPs. The technical report contains many instances where these long context models fail when asked specific questions about the video.

Table Information Extraction

The model can extract table information from videos, though with some inconsistencies:

Table columns are generally correct
Row labels may have errors (e.g., "Concept Resolution" should be "Coref Resolution")
Similar inconsistencies observed across different extraction tasks

Timestamp and Scene Retrieval

Example 1: "At what timestamp does the LLM OS section start?" Response: "The LLM OS section starts at 42:17." ✓ Correct

Example 2: "Can you explain the chart (on the right-hand side) on the self-improvement slide?"

Response: The model provides a detailed explanation of the AlphaGo Zero performance chart, making good use of the visual information provided.

AlphaGo Zero Chart

Code Reasoning

With its long-context reasoning, Gemini 1.5 Pro can answer questions about entire codebases. Using Google AI Studio, you can upload an entire codebase and prompt it with different questions or code-related tasks.

Example: The technical report shows the model given the entire JAX codebase (~746K tokens) and asked to identify the location of a core automatic differentiation method.

Gemini 1.5 Pro JAX Codebase

English to Kalamang Translation

Gemini 1.5 Pro can be provided a grammar manual (500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences) for Kalamang, a language spoken by fewer than 200 speakers worldwide, and translates English to Kalamang at the level of a person learning from the same content.

This showcases the in-context learning abilities of Gemini 1.5 Pro enabled through long context.

Gemini 1.5 Pro Multilinguality

Key Takeaways

Revolutionary Context Window: 1M-10M token capacity unlocks new use cases
Multimodal Excellence: Strong performance across text, video, audio, and code
Efficient Architecture: MoE design provides better performance with less compute
Long-form Understanding: Can process entire books, codebases, and hours of media
Cross-document Reasoning: Ability to analyze relationships between multiple sources
Video Intelligence: Sophisticated understanding of visual content and temporal information

Gemini 1.5 Pro ​

Overview ​

Key Capabilities ​

Context Window Comparison ​

Architecture ​

Key Features ​

Results ​

Retrieval Performance ​

Context Window Capabilities ​

Benchmark Performance ​

Capabilities ​

Long Document Analysis ​

Basic Question Answering ​

Chat Format Interaction ​

Cross-Document Analysis ​

Video Understanding ​

Lecture Summarization ​

Large Language Models: A New Operating System? ​

Introduction ​

How LLMs Work ​

Training LLMs ​

LLM Security ​

Specific Detail Extraction ​

Table Information Extraction ​

Timestamp and Scene Retrieval ​

Code Reasoning ​

English to Kalamang Translation ​

Key Takeaways ​

References ​

Related Topics ​