While people use generative AI for various tasks such as creating customized study materials, brainstorming ideas, or coding, few understand the inner workings of these models. The architecture behind tools like ChatGPT is a powerful model known as the transformer. Transformers were first introduced in 2017 in a research paper titled Attention is All You Need [1]. At the time, the most effective AI language models were recurrent neural networks (RNNs) and long-short term memory networks (LSTMs). However, these models were incredibly inefficient, unable to capture nuances in language, and difficult to scale as datasets increased in size.
On the other hand, transformers rely on a different approach where, instead of processing text sequentially, they take in large chunks of text using the self-attention mechanism to connect words across a large sequence of text. By using parallel processing, transformers overcame the computational cost limitations that came with traditional neural networks. These design innovations allowed transformers to scale to billions of parameters, significantly improving natural language comprehension and the robustness of new language models [1, 2]. As transformers became more adaptable, scalable, and context-aware, modern Large Language Models (LLMs) that we use day-to-day such as ChatGPT, Claude, and DeepSeek have become more prominent. In addition to self-attention and parallel processing, other essential components of the transformer architecture include tokenization, embeddings, multi-head attention block, MLP layer, and the output probabilities.
The goal of a transformer is to predict the next word given a sequence of text. LLMs such as ChatGPTgenerate text by constantly estimating probabilities of next words. They then select one that best fits the context of the text sequence, similar to how phones generate suggested words while texting. LLMs seem “human” in their responses because they are trained on vast datasets that contain examples of text from the internet, a process in which they “learn” the nuances of human language. Although seemingly simple, the process of transformers predicting the next word in a response requires a deep understanding of probability, statistics, and linear algebra. However, in this article we will understand the transformer on a fundamental and intuitive level without the complex math behind the scenes [3].
In order for a transformer to predict the next word in a sequence, the input sequence in human language must be converted into something that it can interpret, similar to how computers use binary code to understand human commands. In the transformer architecture, the input embedding block is responsible for the conversion task. For ChatGPT, the input text is converted into tokens through tokenization, which breaks down the word sequence into singular words or subwords [4]. For simplicity, it can be assumed that each token is a word, but in reality words are broken down into smaller units depending on its length, which includes common prefixes, roots, and suffixes.
Each token is then encoded into a separate high dimensional vector or embedding that stores the meaning of the word it represents [5]. These vector embeddings can go up to thousands of dimensions and are impossible to visualize. However, researchers are able to visualize these embeddings by reducing their size to a two or three dimensional space. In this space, words with similar semantics are positioned close together. For example, the word “man” and “king” would be pointed in a very similar direction since they both denote a male human being. When processing a group of words, determining a word's semantic meaning also depends on its relationship to other words or context. As a result, input token embeddings have an additional positional encoding vector added to them, where each is provided information strictly about its location in text sequence [5, 6]. This allows transformers to have a deeper understanding of the meaning of the token since a word could mean different things in differing contexts. Through intricate embeddings of words and positional encodings, transformers are able to process and understand text.
The embeddings of the input then enters the multi-head attention block of the transformer, where the mechanism of self-attention enables the model to understand the relationship and semantics between words in a sequence [1]. Prior to the transformer, RNNs and LSTMs were only able to contextualize small amounts of tokens. Now, the transformer's self-attention capability allows for a word to be contextualized given a much larger sequence of tokens. Each word’s linguistic components – meaning, syntax, and relationships– are determined based on weighing the relevance of its surrounding words to itself [7]. To interpret self-attention more intuitively, in the sentence, “The young boy adopted a fluffy gray cat”, the word “young” specifies the description of boy, similar to how “fluffy” and “gray” both affect the meaning of cat. On the other hand, words like “the” or “a” have less importance in determining semantics. Essentially, each word determines which other words in the sentence influence its meaning; if a word is significant, its embedding is added to the word, creating an embedding that holds greater semantics.
In the transformer block, multi-head attention runs self-attention in parallel across many tokens, which not only improves the model’s efficiency in understanding text, but also allows the model to focus on different parts of the input simultaneously. In fact, these multiple “heads” can each examine relationships between tokens from different perspectives. For example, one may consider syntax while another might examine semantics [8]. The learned meanings of each token are then applied to the original embedding of the token, transforming the embedding into capturing a more contextualized version of that word. Because each token now contains a much deeper and richer meaning, the transformer model is able to generate text with the knowledge of the relationship between tokens in the input, allowing for significantly more context-aware and coherent responses [2, 8]. This self-attention mechanism gives modern LLMs the ability to understand complex natural language and generate a relevant and effective response in various contexts.
In addition to the multi-head attention block, further refinements for each token’s representation are made through the Multilayer Perceptron (MLP) layer. Instead of each token interacting with other tokens in the sequence like in the self-attention block, in the MLP layer, each embedded representation of a token is separately passed through for adjustments. In this block, each token is essentially examined through answering various questions that clarify its meaning [8]. These changes are then applied to the embedding, enhancing its overall representation of the token’s meaning and syntax.
The input doesn’t go through the transformer block and the MLP layer once. Instead, these two components are stacked many times, allowing each token in the sequence to gradually encapsulate more meaning and complexities. Although this process seems long, transformers take up to fractions of a second to predict the next word, which is why ChatGPT can produce long texts in a few seconds. After passing through all transformer operations, the model focuses on the last embedding or word in the text and applies a mathematical operation known as the softmax function to it, creating a probability distribution over next possible words [1, 8]. The final embedding of a text sequence is similar to the last word in a fill in the blank question. After reading a chunk of text and you are tasked to predict the next word, you wouldn’t just consider the last word alone because it could have an infinitely many meanings. Instead, you would consider the context and the relationship of the relevant words in the text before predicting the next word.
The development of the transformer model has propelled the field of AI to today, where models are able to understand and produce text at a high level of sophistication. Early after the transformer’s establishment, different transformer architectures such as BERT and GPT were created which allowed for even deeper understanding of text and efficient generation of it [9]. Although the transformer was initially designed for natural language, it was later applied to image generation, computer vision, and audio processing. Today, transformer models have been applied across a wide range of fields such as chatbots for customer service, protein folding predictors in healthcare, and consumer LLM tools used everyday. Even though these models have yet to obtain human consciousness, their ability to learn and produce across various fields makes the transformer a core component of progress. As research continues, the transformer continues to be the foundation of artificial intelligence and set the stage for more advancements to be made in the field [10].