LLM Tokenization

Overview

Andrej Karpathy recently published a new lecture on large language model (LLM) tokenization. Tokenization is a key part of training LLMs but it's a process that involves training tokenizers using their own datasets and algorithms (e.g., Byte Pair Encoding).

Key Content

In the lecture, Karpathy teaches how to implement a GPT tokenizer from scratch. He also discusses weird behaviors that trace back to tokenization.

Lecture Reference

"LLM Tokenization"

Figure Source: https://youtu.be/zduSFxRajkE?t=6711

Common LLM Issues Explained by Tokenization

Here is the comprehensive list of issues that can be traced back to tokenization:

1. Spelling Problems

Why can't LLM spell words? Tokenization.

2. String Processing Limitations

Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.

3. Language Bias

Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.

4. Mathematical Limitations

Why is LLM bad at simple arithmetic? Tokenization.

5. Coding Challenges

Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.

6. Unexpected Halting

Why did my LLM abruptly halt when it sees the string "endoftext"? Tokenization.

7. Whitespace Warnings

What is this weird warning I get about a "trailing whitespace"? Tokenization.

8. Specific String Failures

Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.

9. Format Preferences

Why should I prefer to use YAML over JSON with LLMs? Tokenization.

10. Architecture Understanding

Why is LLM not actually end-to-end language modeling? Tokenization.

11. Philosophical Question

What is the real root of suffering? Tokenization.

Practical Implications

To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis on tokenizers (beyond the max_tokens configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt.

Common Problems

You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook.

Tools

A good tool for tokenization is the Tiktokenizer and this is what's actually used in the lecture for demonstration purposes.

Key Takeaways

Tokenization is fundamental to understanding LLM behavior
Many LLM limitations stem from tokenization choices
Prompt engineering should consider tokenization constraints
Understanding tokenization helps debug unexpected LLM behavior
Proper tools like Tiktokenizer are essential for development

Adversarial prompting

Coding

Creativity

Evaluation

LLMs for classification

Image generation

Information extraction

LLM research findings

Mathematics

Models

Question answering

Reasoning

Risks & Misuses

Text summarizations

Truthfulness

LLM Tokenization

Overview

Key Content

Lecture Reference

Common LLM Issues Explained by Tokenization

1. Spelling Problems

2. String Processing Limitations

3. Language Bias

4. Mathematical Limitations

5. Coding Challenges

6. Unexpected Halting

7. Whitespace Warnings

8. Specific String Failures

9. Format Preferences

10. Architecture Understanding

11. Philosophical Question

Practical Implications

Common Problems

Tools

Key Takeaways

LLM Tokenization ​

Overview ​

Key Content ​

Lecture Reference ​

Common LLM Issues Explained by Tokenization ​

1. Spelling Problems ​

2. String Processing Limitations ​

3. Language Bias ​

4. Mathematical Limitations ​

5. Coding Challenges ​

6. Unexpected Halting ​

7. Whitespace Warnings ​

8. Specific String Failures ​

9. Format Preferences ​

10. Architecture Understanding ​

11. Philosophical Question ​

Practical Implications ​

Common Problems ​

Tools ​

Key Takeaways ​

Related Topics ​

LLM Tokenization

Overview

Key Content

Lecture Reference

Common LLM Issues Explained by Tokenization

1. Spelling Problems

2. String Processing Limitations

3. Language Bias

4. Mathematical Limitations

5. Coding Challenges

6. Unexpected Halting

7. Whitespace Warnings

8. Specific String Failures

9. Format Preferences

10. Architecture Understanding

11. Philosophical Question

Practical Implications

Common Problems

Tools

Key Takeaways

Related Topics