Transformers made simple: the magic behind the ChatGPT curtain

Frontier Team


Large language models have emerged as a revolutionary technology to drive intelligent decision making. The breakthrough was the invention of the Transformer. This article explores the capabilities of Transformers in a simple fashion, revealing the magic behind models like ChatGPT. We delve into their main components and shed light on their inner workings.

Transformers: Unraveling the Language Processing Revolution

Firstly, what is a Transformer and what is it trying to do?

It is a type of neural network, which is a software program whose principles are inspired by neurons in the brain.

Neural networks are great at finding patterns in data. One of the most widely used patterns in the world, especially the business world, is natural language.

If we begin a sentence: the cat sat on the _______

You can probably guess the ending, as in the common rhyme: mat. In other words, you can see the pattern. But that’s because your brain knows language. A neural network does not, so it needs to be trained to see patterns. One way is to show it words and ask it to output the next word:

But, here’s the catch. It has to be trained in a way that it outputs the most probable word within context. After all, the cat plausibly could have sat on a table. And, if we change the word cat for king, then perhaps the target word is now more likely to be throne.

How can we train a neural network to fill the gaps? The answer is called Attention.

The Attention Mechanism: Illuminating Contextual Significance

Let’s explore the core mechanism of the Transformer: Attention. It performs a seemingly simple, yet quite complex task: word association.

What is Attention?

Consider our example sentence: the cat sat on the _______

In guessing the missing word, the attention mechanism tries to figure out how to pay attention to other words as a clue for how to fill the gap:

Any of the preceding words could potentially influence what fills the gap. If all we had was just the last two: the _______ then we would have a long list of candidates from the training vocabulary. However, as we take more words into account, we get a different picture:

If we consider only words that go with the, then it’s a long list. We show a few examples in purple. The word mat is possible, but maybe way down in the list of probabilities. Hence we show it as pale.

If we also pay attention to on, then there are all kinds of words that might have an association, as in follow-on words, like fire, or associated words, like mat because on and mat do appear together in this well-known rhyme.

If all we have to go on is sat on the, then perhaps floor is most likely.

We keep on going until we pay attention to all words in the sentence. Finally, although the word mat was weakly scored in isolation, it is highly probable once we pay attention to all the other words. In combination, it is now more probable than floor.

Self-Attention and Learning

The Attention mechanism (AM) tells us which words to pay attention to in deciding the most probable word that belongs in the gap.

But how does the mechanism know how to pay attention?

This is what it learns by trying to guess the missing word. To begin with, the AM assumes that all words in a sentence have equal (or random) chance of paying attention to each other, shown here just for the word the:

At this point, the neural network could pick any random word to fill the gap as it has no attention to guide it. Whatever it chooses, this output is compared to the original (training) sentence and a special technique is used to measure the distance of the output from the training data:

The error measurement is fed back and used to guide the AM in adjusting the attention weights until, over many examples and training cycles, the AM begins to learn patterns of attention that help it to guess the right word.

These patterns are of many kinds, such as the high probability of a noun following the word the:

Or, the attention pattern of a thing (like cat) often being associated with a verb (sat), which in grammar is called the subject-predicate pattern:

This style of learning is called self-attention because the attention we are learning is words within the same sentence (i.e. self).

These examples are entirely plausible, although we have presented a somewhat simplified view of a transformer thus far. Things get more interesting when we combine lots of attention patterns.

Multi-headed, multi-layered

An Attention Mechanism (AM) can only output a single set of connections (attention) between words, such as the "the + noun" pattern above. This would not be enough to properly guess the missing word because, as we showed in the original example, different attention patterns could influence the final decision about the missing word.

In other words, the different colored arrows represent different types of attention. Therefore, we need multiple AMs. For example, if we didn’t have the red attention pattern that somehow related cat with mat, we might have been left with floor as more likely.

Therefore, we need multiple AMs, or what are called multiple AM heads. The mechanism for combining these is another neural-network layer that learns some function for how best to combine AM outputs, or different attention patterns, to get the final answer. The best combination will depend upon the input sequence.

[Note that it turns out that much of the “power” of a Transformer comes from the Feed Forward Network layer, at least in terms of how many neural-network components are dedicated to this layer in comparison with the attention heads.]

Generalization: The Real Superpower

By giving the Transformer enough examples of different sentences, the AMs can slowly learn different attention patterns in the training set and the NN layer can slowly learn how best to combine them.

Consider the sequence: the boy ran across the busy ______

Ideally, we want the AM + neural layer combination to have learned enough patterns that when presented with a sequence it has not seen before (not in the training data), like the boy ran across the open ______, it can generalize and predict field.

Other options are possible: beach, meadow, street, park

But, if we knew that the previous sentence was: Whilst on holiday by the sea, then beach becomes more contextually relevant.

To get this kind of generalization power in a way highly useful for general language processing, it turns out we need 3 things:

  1. To stack many layers of AMs and neural layers
  2. To train over a truly vast set of examples
  3. To use a wide enough AM to handle really long passages of text (far greater than just a single sentence)

Stacking layers helps to find increasingly abstract and sophisticated attention pattern combinations. For example, perhaps the entire phrase holiday by the sea influences the word beach via the phrase the boy ran.

The pattern might be something like: event –> activity –> place.

Layers provide higher-order abstraction.

Vast examples give us lots of these attention patterns.

Text-width provides greater contextual understanding.

Combined gives greater language understanding

What is understanding?

The ability for Transformers to predict words can be extended, naturally, to entire sentences and even further to entire passages of text, generated word by word using the Attention mechanism as just described.

This works because it turns out that if the system is big enough and has seen enough data, it can find an attention pattern to fit most types of input. And, it also turns out that this is sufficient to generate outputs that are coherent.

The inputs are called prompts.

The system computes what text plausibly might follow the prompt, only using mind-bogglingly complex combinations of massive numbers of attention patterns over multiple scales, from adjacent words to large abstract spans that are entirely opaque – we don’t know what they are.

As such, the model has no “understanding” of what it’s doing beyond the ability to continually add words that are highly probable in relation to the prompt and the entire training corpus via a massive array of nuanced attention (“word association”) patterns.

Some confusion arises when we use words like understanding because we have familiar human intuitions about such concepts and naturally want to project these onto Transformers. For a good discussion of the pitfalls of using such intuitions, we recommend the paper: “Talking About Large Language Models”.

For a more in-depth discussion of “understanding” and why it matters, see the follow-on post: “Do LLMs really grok our prompts? Does it matter?”


This has been a highly simplified version of how Transformers work, but accurate in its essence. A model built in this way on massive general datasets, like the web, is called a Foundation Model. This is because it has learned so many patterns that it can form the foundation for a great number of language tasks.

The next step for the enterprise is often to fine-tune the model so that the attention patterns and vocabulary are more aligned with domain-specific language and tasks. There could well be attention patterns that need more emphasis when dealing with, say, product-specific language contextualized to discussions found in a sales CRM.

Companies can use these models to unlock new language-processing capabilities in their organization. We advise adopting a holistic approach in order to fully utilize their magic.

Let's build your AI frontier.

The field of AI is accelerating. Doing nothing is going backward. Book a 1:1 consultation today.