Skip to content

Gemini 1.5 Pro

Overview

Google introduces Gemini 1.5 Pro, a compute-efficient multimodal mixture-of-experts model. This AI model focuses on capabilities such as recalling and reasoning over long-form content. Gemini 1.5 Pro can reason over long documents potentially containing millions of tokens, including hours of video and audio.

Key Capabilities

  • Long-form content processing: Millions of tokens, hours of video and audio
  • State-of-the-art performance in long-document QA, long-video QA, and long-context ASR
  • Matches or outperforms Gemini 1.0 Ultra across standard benchmarks
  • Near-perfect retrieval (>99%) up to at least 10 million tokens
  • 1 million token context window available in Google AI Studio

Context Window Comparison

  • 200K tokens: Largest context window of any available LLM to date
  • 1 million tokens: New experimental capability in Google AI Studio
  • 10 million tokens: Maximum demonstrated capability

Architecture

Gemini 1.5 Pro is a sparse mixture-of-experts (MoE) Transformer based model built on Gemini 1.0's multimodal capabilities.

Key Features

  • MoE Architecture: Total parameters can grow while keeping activated parameters constant
  • Efficient Training: Significantly less training compute required
  • Efficient Serving: More efficient to serve than previous models
  • Long-context Understanding: Architecture changes enable processing up to 10 million tokens
  • Multimodal Pre-training: Trained on different modalities with instruction tuning
  • Human Preference Tuning: Further refined based on human preference data

Results

Retrieval Performance

Gemini 1.5 Pro achieves near-perfect "needle" recall of up to 1 million tokens in all modalities (text, video, and audio).

Context Window Capabilities

Gemini 1.5 Pro can process and maintain recall performance when extending to:

  • ~22 hours of recordings
  • 10 x 1440 pages book
  • Entire codebases
  • 3 hours of video at 1 fps

Gemini 1.5 Pro Retrieval Results

Benchmark Performance

Gemini 1.5 Pro surpasses Gemini 1.0 Pro on the majority of benchmarks with significant performance in:

  • Math
  • Science
  • Reasoning
  • Multilinguality
  • Video Understanding
  • Code

Gemini 1.5 Pro Results

Note: Gemini 1.5 Pro also outperforms Gemini 1.0 Ultra on half of the benchmarks despite using significantly less training compute.

Capabilities

Long Document Analysis

Basic Question Answering

To demonstrate Gemini 1.5 Pro abilities to process and analyze documents, we start with a very basic question answering task. The Gemini 1.5 Pro model in the Google AI Studio supports up to 1 million tokens so we are able to upload entire PDFs.

Example: Upload a PDF and ask "What is the paper about?"

Gemini 1.5 Pro Document Analysis

The model's response is accurate and concise as it provides an acceptable summary of the Galactica paper.

Chat Format Interaction

You can also use the chat format to interact with an uploaded PDF. This is a useful feature if you have many questions you would like to get answered from the provided document/s.

Gemini 1.5 Pro Chat

Cross-Document Analysis

To leverage the long context window, let's now upload two PDFs and ask a question that spans both PDFs.

Gemini 1.5 Pro Cross-Document

Key Insights:

  • Information extracted from the first paper (survey on LLMs) comes from a table
  • "Architecture" information looks correct
  • "Performance" section doesn't belong there (not found in first paper)

Best Practices:

  • Put the prompt at the top
  • Label papers with tags (Paper 1, Paper 2)
  • Be specific about which paper to reference

Video Understanding

Gemini 1.5 Pro is trained with multimodal capabilities from the ground up and demonstrates strong video understanding capabilities.

Video understanding

Lecture Summarization

We tested with Andrej Karpathy's recent lecture on LLMs.

Task 1: "What is the lecture about?"

  • Response accurately summarizes the lecture

Task 2: "Provide a concise outline (one-page length) of the lecture"

Response (edited for brevity):

Large Language Models: A New Operating System?

Introduction

  • LLMs are trained on massive amounts of text data and can generate human-quality text
  • They are becoming increasingly powerful and are being used for a wide range of tasks

How LLMs Work

  • LLMs are neural networks that predict the next word in a sequence
  • They learn by iteratively adjusting their parameters to make better predictions
  • While we can measure their performance, we don't fully understand how the billions of parameters collaborate

Training LLMs

  • Requires massive amounts of data and computing power
  • Can be thought of as "compressing the internet" into a single file
  • For example, training LLaMA 2 70B required 6,000 GPUs for 12 days and cost ~$2 million

LLM Security

  • LLM security is a new and rapidly evolving field
  • Key security risks include jailbreaking, prompt injection, and data poisoning

Specific Detail Extraction

Example: "What are the FLOPs reported for Llama 2 in the lecture?"

Response: "The lecture reports that training Llama 2 70B required approximately 1 trillion FLOPs."

Note: This is not accurate. The correct response should be ~1e24 FLOPs. The technical report contains many instances where these long context models fail when asked specific questions about the video.

Table Information Extraction

The model can extract table information from videos, though with some inconsistencies:

  • Table columns are generally correct
  • Row labels may have errors (e.g., "Concept Resolution" should be "Coref Resolution")
  • Similar inconsistencies observed across different extraction tasks

Timestamp and Scene Retrieval

Example 1: "At what timestamp does the LLM OS section start?" Response: "The LLM OS section starts at 42:17." ✓ Correct

Example 2: "Can you explain the chart (on the right-hand side) on the self-improvement slide?"

Response: The model provides a detailed explanation of the AlphaGo Zero performance chart, making good use of the visual information provided.

AlphaGo Zero Chart

Code Reasoning

With its long-context reasoning, Gemini 1.5 Pro can answer questions about entire codebases. Using Google AI Studio, you can upload an entire codebase and prompt it with different questions or code-related tasks.

Example: The technical report shows the model given the entire JAX codebase (~746K tokens) and asked to identify the location of a core automatic differentiation method.

Gemini 1.5 Pro JAX Codebase

English to Kalamang Translation

Gemini 1.5 Pro can be provided a grammar manual (500 pages of linguistic documentation, a dictionary, and ~400 parallel sentences) for Kalamang, a language spoken by fewer than 200 speakers worldwide, and translates English to Kalamang at the level of a person learning from the same content.

This showcases the in-context learning abilities of Gemini 1.5 Pro enabled through long context.

Gemini 1.5 Pro Multilinguality

Key Takeaways

  1. Revolutionary Context Window: 1M-10M token capacity unlocks new use cases
  2. Multimodal Excellence: Strong performance across text, video, audio, and code
  3. Efficient Architecture: MoE design provides better performance with less compute
  4. Long-form Understanding: Can process entire books, codebases, and hours of media
  5. Cross-document Reasoning: Ability to analyze relationships between multiple sources
  6. Video Intelligence: Sophisticated understanding of visual content and temporal information

References