Back to all posts

Artificial IntelligenceLLMsMachine Learning

Large Language Model Tokenizers Explained

How tokenizers convert human language into the integer sequences LLMs process, from simple vocabularies to BPE and production tokenizers.

Noah Kurz

March 14, 2026

8 min read

Series

Building My Own LLM

1.
Large Language Model Transformers Explained
2.
Large Language Model Tokenizers Explained

Tokenizers: The Translation Layer Between Humans and Machines

Large language models do not read English, Spanish, Python, or any other language we write in. They only understand numbers. Every sentence you type into ChatGPT, every prompt you send to an API, gets converted into a sequence of integers before the model ever sees it. The system responsible for that conversion is called a tokenizer.

A tokenizer does two things:

Encode: turn text into a list of numbers, usually called token IDs
Decode: turn a list of token IDs back into text

That is it. It is a translation layer. Humans speak in text, machines speak in numbers, and the tokenizer sits in between.

Building a Simple Tokenizer

The simplest possible tokenizer is a lookup table. You define a vocabulary, which is just a dictionary that maps every known word and punctuation mark to a unique integer.

vocab = {
    "hello": 0,
    "world": 1,
    "this": 2,
    "is": 3,
    "a": 4,
    "test": 5,
    ",": 6,
    "!": 7,
    ".": 8,
}

Then encoding is just splitting the text and looking up each piece:

import re
 
 
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
 
    def encode(self, text):
        preprocessed = re.split(r"([,.?_!\"()`']|--|\s)", text.lower())
        preprocessed = [item.strip() for item in preprocessed if item.strip()]
        ids = [self.str_to_int[item] for item in preprocessed]
        return ids
 
    def decode(self, ids):
        text = " ".join([self.int_to_str[token_id] for token_id in ids])
        text = re.sub(r"\s+([,.?_!\"()`'])", r"\1", text)
        return text

The encode method lowercases the input, splits it into words and punctuation using a regex, then converts each piece into an integer using the vocab dictionary. The decode method reverses the process. It maps IDs back to strings, joins them with spaces, and then cleans up spacing around punctuation.

For the sentence "Hello, world!", this produces:

[0, 6, 1, 7]

Four tokens. Four numbers. That is what the model would see.

How Spaces Are Handled

One subtle detail in our simple tokenizer is that it does not really preserve spaces. It uses whitespace as a separator, not as something meaningful to encode.

This line is what does it:

preprocessed = re.split(r"([,.?_!\"()`']|--|\s)", text.lower())

The \s part means "split on whitespace." Then this line removes the empty pieces and strips away surrounding spaces:

preprocessed = [item.strip() for item in preprocessed if item.strip()]

That means an input like "Hello, world!" gets reduced to something more like:

["hello", ",", "world", "!"]

The three spaces between "Hello," and "world!" are not stored anywhere. They do not become tokens. They just disappear.

The same thing happens in reverse when we decode. The tokenizer joins every token with exactly one space:

text = " ".join([self.int_to_str[token_id] for token_id in ids])

Then it removes the extra spaces before punctuation:

text = re.sub(r"\s+([,.?_!\"()`'])", r"\1", text)

So the simple tokenizer does not preserve the original formatting. Multiple spaces, tabs, and newlines are effectively normalized away.

Modern LLMs handle spaces very differently. Instead of throwing whitespace away, their tokenizers usually encode it as part of the token stream.

In practice, that often means one of three things:

A token includes its leading space, so " world" is a different token from "world"
A tokenizer uses a visible whitespace marker, such as ▁world, to mean "world that begins after a space"
A byte-level tokenizer can represent spaces, tabs, and newlines directly

So while our toy tokenizer might represent:

["hello", ",", "world", "!"]

a modern tokenizer may represent the same idea more like:

["Hello", ",", " world", "!"]

That difference matters. It means the model can tell the difference between "hello world", "helloworld", "hello world", and even indented code. For modern LLMs, whitespace is not just formatting. It is part of the input the model actually sees.

The Unknown Token Problem

This approach has an obvious flaw: what happens when the tokenizer encounters a word that is not in the vocabulary?

If someone types "Hello, world! This is a test. dog" and "dog" is not in our vocab, the lookup fails with a KeyError. The tokenizer simply cannot represent something it has never seen.

One common solution is the unknown token, usually written as <UNK>. The idea is simple. If a word is not in the vocabulary, replace it with a special placeholder.

class SimpleTokenizerV2:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i: s for s, i in vocab.items()}
 
    def encode(self, text):
        preprocessed = re.split(r"([,.?_!\"()`']|--|\s)", text.lower())
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        preprocessed = [
            item if item in self.str_to_int else "<UNK>"
            for item in preprocessed
        ]
        ids = [self.str_to_int[item] for item in preprocessed]
        return ids

Now the vocabulary includes the special token:

vocab = {
    "hello": 0,
    "world": 1,
    "this": 2,
    "is": 3,
    "a": 4,
    "test": 5,
    ",": 6,
    "!": 7,
    ".": 8,
    "<UNK>": 9,
}

With this change, "dog" gets encoded as 9, the ID for <UNK>. The tokenizer no longer crashes.

But there is a cost: information gets destroyed. When you decode those IDs back into text, "dog" is gone. It has been replaced by <UNK>. The model never learned anything specific about that word. It just saw a generic "I do not know this" signal. If your vocabulary is small and many words map to <UNK>, the model is effectively working blindfolded.

How Modern LLMs Avoid This: Subword Tokenization

Modern language models like GPT, LLaMA, and Claude do not use word-level tokenization. They use subword tokenization, a strategy that breaks text into smaller pieces so that any input can be represented, even words the model has never seen before.

Instead of needing a vocab entry for "unhappiness", a subword tokenizer might split it into:

["un", "happiness"]

Or a rare word like "blargify":

["bl", "arg", "ify"]

Each of those pieces exists in the vocabulary. No <UNK> needed. The model can always represent the input, and it can even reason about the structure of unfamiliar words by recognizing familiar subparts.

The most common algorithm behind this is Byte Pair Encoding (BPE). At a high level, it works like this:

Start with a base vocabulary of individual characters or bytes
Scan a massive corpus of training text
Find the most frequently occurring pair of adjacent tokens
Merge that pair into a new token
Repeat thousands of times

Each merge creates a new, longer token. Common words like "the" quickly become single tokens. Rare words stay broken into smaller pieces. The process stops when the vocabulary reaches a target size, typically somewhere between 30,000 and 100,000+ tokens.

How AI Labs Build Their Vocabularies

The vocab dictionary in our simple example was hand-written with 10 entries. Real AI labs build vocabularies with tens of thousands of entries, and the process is entirely data-driven.

The general approach looks like this:

Collect a massive text corpus from books, websites, code, and other sources
Run BPE or a similar algorithm on that corpus to learn which character sequences appear most frequently
Choose a vocabulary size based on the tradeoff between efficiency and granularity
Include special tokens for things like document boundaries, padding, or control markers

That vocabulary size decision matters more than it might seem. GPT-2 uses roughly 50,257 tokens. GPT-4 uses around 100,000. Larger vocabularies mean common words and phrases are more likely to get single-token representations, which is more efficient. The downside is that the embedding table the model needs to store gets larger too.

The resulting vocabulary is a fixed artifact. Once trained, it does not change. Every piece of text the model ever processes, during training and during inference, goes through the same tokenizer with the same vocabulary.

This is why tokenizer design matters so much. A vocabulary trained mostly on English text will need more tokens to represent Korean or Arabic. A vocabulary that does not include code-specific patterns will tokenize Python inefficiently. The tokenizer shapes what the model can see and how efficiently it can process different kinds of input.

Using Tiktoken: A Production Tokenizer

OpenAI's tiktoken library gives you access to the tokenizers used by GPT models. It is fast, because it is implemented in Rust under the hood, and it is simple to use:

import tiktoken
 
tokenizer = tiktoken.get_encoding("gpt2")
 
text = "Hello, world! This is a test. dog"
ids = tokenizer.encode(text)
print(ids)
 
decoded = tokenizer.decode(ids)
print(decoded)

Unlike our simple tokenizer, tiktoken handles "dog" without any problem. It does not need an <UNK> token because it uses BPE. If a word is not a single token, it gets broken into subword pieces that are.

The "gpt2" encoding is a useful baseline to experiment with, and "cl100k_base" is a good example of a newer GPT-family tokenizer that handles a wider range of languages and code more efficiently.

Why This Matters

A tokenizer is deceptively simple in concept. It is just a mapping between text and numbers. But the design of that mapping has deep consequences for what a language model can do, how efficiently it processes input, and how well it handles the full diversity of human language.

The evolution from word-level vocabularies, with their <UNK> problem, to subword tokenization, where anything can be represented, is one of the key enabling ideas behind modern LLMs. It is the reason you can type a made-up word, a URL, or a line of code into ChatGPT and still get a coherent response back.

The model never reads your words. It reads the numbers your tokenizer chose.