Skip to content

LLM Tokenization

Overview

Andrej Karpathy recently published a new lecture on large language model (LLM) tokenization. Tokenization is a key part of training LLMs but it's a process that involves training tokenizers using their own datasets and algorithms (e.g., Byte Pair Encoding).

Key Content

In the lecture, Karpathy teaches how to implement a GPT tokenizer from scratch. He also discusses weird behaviors that trace back to tokenization.

Lecture Reference

"LLM Tokenization"

Figure Source: https://youtu.be/zduSFxRajkE?t=6711

Common LLM Issues Explained by Tokenization

Here is the comprehensive list of issues that can be traced back to tokenization:

1. Spelling Problems

  • Why can't LLM spell words? Tokenization.

2. String Processing Limitations

  • Why can't LLM do super simple string processing tasks like reversing a string? Tokenization.

3. Language Bias

  • Why is LLM worse at non-English languages (e.g. Japanese)? Tokenization.

4. Mathematical Limitations

  • Why is LLM bad at simple arithmetic? Tokenization.

5. Coding Challenges

  • Why did GPT-2 have more than necessary trouble coding in Python? Tokenization.

6. Unexpected Halting

  • Why did my LLM abruptly halt when it sees the string "endoftext"? Tokenization.

7. Whitespace Warnings

  • What is this weird warning I get about a "trailing whitespace"? Tokenization.

8. Specific String Failures

  • Why the LLM break if I ask it about "SolidGoldMagikarp"? Tokenization.

9. Format Preferences

  • Why should I prefer to use YAML over JSON with LLMs? Tokenization.

10. Architecture Understanding

  • Why is LLM not actually end-to-end language modeling? Tokenization.

11. Philosophical Question

  • What is the real root of suffering? Tokenization.

Practical Implications

To improve the reliability of LLMs, it's important to understand how to prompt these models which will also involve understanding their limitations. While there isn't too much emphasis on tokenizers (beyond the max_tokens configuration) at inference time, good prompt engineering involves understanding the constraints and limitations inherent in tokenization similar to how to structure or format your prompt.

Common Problems

You could have a scenario where your prompt is underperforming because it's failing to, for instance, understand an acronym or concept that's not properly processed or tokenized. That's a very common problem that a lot of LLM developers and researchers overlook.

Tools

A good tool for tokenization is the Tiktokenizer and this is what's actually used in the lecture for demonstration purposes.

Key Takeaways

  1. Tokenization is fundamental to understanding LLM behavior
  2. Many LLM limitations stem from tokenization choices
  3. Prompt engineering should consider tokenization constraints
  4. Understanding tokenization helps debug unexpected LLM behavior
  5. Proper tools like Tiktokenizer are essential for development