Back to all posts
Large Language Model Tokenizers Explained
How tokenizers convert human language into the integer sequences LLMs process, from simple vocabularies to BPE and production tokenizers.
N
Noah Kurz
March 14, 2026
8 min read
1.
Large Language Model Transformers Explained2.
Large Language Model Tokenizers Explained
Tokenizers: The Translation Layer Between Humans and Machines
Large language models do not read English, Spanish, Python, or any other language we write in. They only understand numbers. Every sentence you type into ChatGPT, every prompt you send to an API, gets converted into a sequence of integers before the model ever sees it. The system responsible for that conversion is called a tokenizer.
A tokenizer does two things:
- Encode: turn text into a list of numbers, usually called token IDs
- Decode: turn a list of token IDs back into text
That is it. It is a translation layer. Humans speak in text, machines speak in numbers, and the tokenizer sits in between.
Building a Simple Tokenizer
The simplest possible tokenizer is a lookup table. You define a vocabulary, which is just a dictionary that maps every known word and punctuation mark to a unique integer.
vocab = {
"hello": 0,
"world": 1,
"this": 2,
"is": 3,
"a": 4,
"test": 5,
",": 6,
"!": 7,
".": 8,
}Then encoding is just splitting the text and looking up each piece:
import re
class SimpleTokenizerV1:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i: s for s, i in vocab.items()}
def encode(self, text):
preprocessed = re.split(r"([,.?_!\"()`']|--|\s)", text.lower())
preprocessed = [item.strip() for item in preprocessed if item.strip()]
ids = [self.str_to_int[item] for item in preprocessed]
return ids
def decode(self, ids):
text = " ".join([self.int_to_str[token_id] for token_id in ids])
text = re.sub(r"\s+([,.?_!\"()`'])", r"\1", text)
return textThe encode method lowercases the input, splits it into words and punctuation using a regex, then converts each piece into an integer using the vocab dictionary. The decode method reverses the process. It maps IDs back to strings, joins them with spaces, and then cleans up spacing around punctuation.
For the sentence "Hello, world!", this produces:
[0, 6, 1, 7]Four tokens. Four numbers. That is what the model would see.
How Spaces Are Handled
One subtle detail in our simple tokenizer is that it does not really preserve spaces. It uses whitespace as a separator, not as something meaningful to encode.
This line is what does it:
preprocessed = re.split(r"([,.?_!\"()`']|--|\s)", text.lower())The \s part means "split on whitespace." Then this line removes the empty pieces and strips away surrounding spaces:
preprocessed = [item.strip() for item in preprocessed if item.strip()]That means an input like "Hello, world!" gets reduced to something more like:
["hello", ",", "world", "!"]The three spaces between "Hello," and "world!" are not stored anywhere. They do not become tokens. They just disappear.
The same thing happens in reverse when we decode. The tokenizer joins every token with exactly one space:
text = " ".join([self.int_to_str[token_id] for token_id in ids])Then it removes the extra spaces before punctuation:
text = re.sub(r"\s+([,.?_!\"()`'])", r"\1", text)So the simple tokenizer does not preserve the original formatting. Multiple spaces, tabs, and newlines are effectively normalized away.
Modern LLMs handle spaces very differently. Instead of throwing whitespace away, their tokenizers usually encode it as part of the token stream.
In practice, that often means one of three things:
- A token includes its leading space, so
" world"is a different token from"world" - A tokenizer uses a visible whitespace marker, such as
▁world, to mean "world that begins after a space" - A byte-level tokenizer can represent spaces, tabs, and newlines directly
So while our toy tokenizer might represent:
["hello", ",", "world", "!"]a modern tokenizer may represent the same idea more like:
["Hello", ",", " world", "!"]That difference matters. It means the model can tell the difference between "hello world", "helloworld", "hello world", and even indented code. For modern LLMs, whitespace is not just formatting. It is part of the input the model actually sees.
The Unknown Token Problem
This approach has an obvious flaw: what happens when the tokenizer encounters a word that is not in the vocabulary?
If someone types "Hello, world! This is a test. dog" and "dog" is not in our vocab, the lookup fails with a KeyError. The tokenizer simply cannot represent something it has never seen.
One common solution is the unknown token, usually written as <UNK>. The idea is simple. If a word is not in the vocabulary, replace it with a special placeholder.
class SimpleTokenizerV2:
def __init__(self, vocab):
self.str_to_int = vocab
self.int_to_str = {i: s for s, i in vocab.items()}
def encode(self, text):
preprocessed = re.split(r"([,.?_!\"()`']|--|\s)", text.lower())
preprocessed = [
item.strip() for item in preprocessed if item.strip()
]
preprocessed = [
item if item in self.str_to_int else "<UNK>"
for item in preprocessed
]
ids = [self.str_to_int[item] for item in preprocessed]
return idsNow the vocabulary includes the special token:
vocab = {
"hello": 0,
"world": 1,
"this": 2,
"is": 3,
"a": 4,
"test": 5,
",": 6,
"!": 7,
".": 8,
"<UNK>": 9,
}With this change, "dog" gets encoded as 9, the ID for <UNK>. The tokenizer no longer crashes.
But there is a cost: information gets destroyed. When you decode those IDs back into text, "dog" is gone. It has been replaced by <UNK>. The model never learned anything specific about that word. It just saw a generic "I do not know this" signal. If your vocabulary is small and many words map to <UNK>, the model is effectively working blindfolded.
How Modern LLMs Avoid This: Subword Tokenization
Modern language models like GPT, LLaMA, and Claude do not use word-level tokenization. They use subword tokenization, a strategy that breaks text into smaller pieces so that any input can be represented, even words the model has never seen before.
Instead of needing a vocab entry for "unhappiness", a subword tokenizer might split it into:
["un", "happiness"]Or a rare word like "blargify":
["bl", "arg", "ify"]Each of those pieces exists in the vocabulary. No <UNK> needed. The model can always represent the input, and it can even reason about the structure of unfamiliar words by recognizing familiar subparts.
The most common algorithm behind this is Byte Pair Encoding (BPE). At a high level, it works like this:
- Start with a base vocabulary of individual characters or bytes
- Scan a massive corpus of training text
- Find the most frequently occurring pair of adjacent tokens
- Merge that pair into a new token
- Repeat thousands of times
Each merge creates a new, longer token. Common words like "the" quickly become single tokens. Rare words stay broken into smaller pieces. The process stops when the vocabulary reaches a target size, typically somewhere between 30,000 and 100,000+ tokens.
How AI Labs Build Their Vocabularies
The vocab dictionary in our simple example was hand-written with 10 entries. Real AI labs build vocabularies with tens of thousands of entries, and the process is entirely data-driven.
The general approach looks like this:
- Collect a massive text corpus from books, websites, code, and other sources
- Run BPE or a similar algorithm on that corpus to learn which character sequences appear most frequently
- Choose a vocabulary size based on the tradeoff between efficiency and granularity
- Include special tokens for things like document boundaries, padding, or control markers
That vocabulary size decision matters more than it might seem. GPT-2 uses roughly 50,257 tokens. GPT-4 uses around 100,000. Larger vocabularies mean common words and phrases are more likely to get single-token representations, which is more efficient. The downside is that the embedding table the model needs to store gets larger too.
The resulting vocabulary is a fixed artifact. Once trained, it does not change. Every piece of text the model ever processes, during training and during inference, goes through the same tokenizer with the same vocabulary.
This is why tokenizer design matters so much. A vocabulary trained mostly on English text will need more tokens to represent Korean or Arabic. A vocabulary that does not include code-specific patterns will tokenize Python inefficiently. The tokenizer shapes what the model can see and how efficiently it can process different kinds of input.
Using Tiktoken: A Production Tokenizer
OpenAI's tiktoken library gives you access to the tokenizers used by GPT models. It is fast, because it is implemented in Rust under the hood, and it is simple to use:
import tiktoken
tokenizer = tiktoken.get_encoding("gpt2")
text = "Hello, world! This is a test. dog"
ids = tokenizer.encode(text)
print(ids)
decoded = tokenizer.decode(ids)
print(decoded)Unlike our simple tokenizer, tiktoken handles "dog" without any problem. It does not need an <UNK> token because it uses BPE. If a word is not a single token, it gets broken into subword pieces that are.
The "gpt2" encoding is a useful baseline to experiment with, and "cl100k_base" is a good example of a newer GPT-family tokenizer that handles a wider range of languages and code more efficiently.
Why This Matters
A tokenizer is deceptively simple in concept. It is just a mapping between text and numbers. But the design of that mapping has deep consequences for what a language model can do, how efficiently it processes input, and how well it handles the full diversity of human language.
The evolution from word-level vocabularies, with their <UNK> problem, to subword tokenization, where anything can be represented, is one of the key enabling ideas behind modern LLMs. It is the reason you can type a made-up word, a URL, or a line of code into ChatGPT and still get a coherent response back.
The model never reads your words. It reads the numbers your tokenizer chose.