Back to all posts
Large Language Model Transformers Explained
A deep dive into the transformers that power the LLMs we know and love.
N
Noah Kurz
March 9, 2026
10 min read
1.
Large Language Model Transformers Explained
2.
Large Language Model Tokenizers Explained
Before I started to build my own LLM I wanted to know how they work more under the hood. To do this I researched how models take in our input and process it to get to their eventual result. I quickly found that most modern LLMs use something called "transformers", to start the processing request.
So what are transformers?
Transformers are how LLMs determine the context of our queries, and eventually process them. They assign weight to certain words depending on their order in the prompt, the semantic meaning, and the deduced meaning from the syntax of the prompt.
The main mechanism that makes this possible is called attention. Attention is the process that lets the model look at one token in relation to the other tokens around it and decide which ones matter most for understanding meaning. Instead of treating every word as equally important, attention helps the model focus on the parts of the input that give a word its context. That ability to dynamically focus on relevant words is a huge part of what made transformers such a big step forward.
Encoders: Understanding the Input
To start, you can think of an encoder as the "understander" part of the transformer. Its job is to take the input phrase, break it into tokens, and build up a richer meaning for each token as it moves through multiple layers. Early on, a token might just represent the word itself. But as it passes through each layer, the model starts to understand how that word relates to the rest of the sentence. That means words are not interpreted in isolation. Their meaning changes depending on what is around them.
This is where attention becomes really important. Instead of reading a sentence strictly left to right and hoping the meaning sticks, the encoder can look at all of the words together and figure out which ones matter most to each other. For example, if the prompt says "The bank by the river was flooded", attention helps the model understand that "bank" probably does not mean a financial institution. Over many layers, the encoder keeps refining that understanding until it has a much more contextual representation of the full input.
Decoders: Generating the Response
The decoder is the "generator" part. Once the model has an understanding of the input, the decoder uses that understanding to begin creating an output sequence one token at a time. This process is called autoregressive generation. That just means the model generates the next token based on everything it has already seen and everything it has already generated so far. It is always predicting "what should come next?" and then repeating that process until it reaches a complete response.
The decoder also uses attention, but in a slightly different way. It has to pay attention to the tokens it has already generated so it stays consistent and does not drift too far off track. In encoder-decoder architectures, it also looks back at the encoder's output so it can stay grounded in the original input. That is what lets it generate something relevant instead of just producing a vaguely plausible sentence. In simple terms, the encoder figures out what was meant, and the decoder turns that understanding into a response.
How They Work Together
When both pieces are used together, the flow is pretty intuitive. The encoder reads and interprets the full input sequence. It creates a set of internal representations that capture meaning, relationships, and context. Then the decoder uses those representations as a source of truth while generating the output sequence step by step. This design is especially useful for tasks like translation or summarization, where the model benefits from first deeply understanding the full input before it starts writing.
Why Some Models Use Only Decoders
That said, not every modern LLM uses both parts. A lot of the models we use today are decoder-only models. These models skip the dedicated encoder and instead learn to understand the prompt while also generating the response in a single stack of transformer blocks. They still use attention, but they do it in a way that only lets them look at the current prompt and the tokens that came before. Because they are trained on massive amounts of text by predicting the next token, they get surprisingly good at both understanding and generating without a separate encoder.
This decoder-only setup is popular because it is simpler and works extremely well for open-ended text generation, chat, and code completion. If your main goal is "given this prompt, keep generating the next useful token", a decoder-only model is often enough. But when you have tasks where the input and output are clearly different sequences, like translating one language into another or turning a long document into a short summary, using both an encoder and decoder can still be a really strong design. In those cases, having one part focused entirely on understanding and one part focused entirely on generating can be an advantage.
Final Thoughts
The more I learned about transformers, the more they started to feel less like magic and more like really well-designed systems for handling context. At a high level, that is what makes them so powerful. They are not just storing words. They are modeling relationships between words, building meaning across layers, and then using that meaning to predict what should come next. Once I understood that, the idea of building my own LLM started to feel a lot less mysterious.