Building a Transformer Block: From Attention to Complete GPT

In the previous post, we built self-attention from raw matrix operations. Now we assemble all the pieces into a complete decoder-only transformer — the architecture behind GPT, LLaMA, Mistral, and most production LLMs today.

A transformer block is surprisingly simple once you see it: self-attention, a feed-forward network, residual connections, and normalization. Stack N of these blocks and you have GPT. We’ll build each component, then wire them together into a working model that can actually generate text.

RMSNorm: Why Modern LLMs Dropped LayerNorm

Every transformer block needs normalization to keep activations stable during training. GPT-2 used LayerNorm, which centers the input to zero mean and unit variance. LLaMA and most modern models switched to RMSNorm — it skips the centering step and just divides by the root mean square. Fewer operations, roughly 10% faster, works just as well.

class RMSNorm(nn.Module):
    """Root Mean Square Normalization (used in LLaMA)."""

    def __init__(self, dim, eps=1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def forward(self, x):
        rms = x.pow(2).mean(-1, keepdim=True).add(self.eps).sqrt()
        return x / rms * self.weight

LayerNorm has 2 × d_model parameters (weight + bias). RMSNorm has only d_model (weight only). The difference matters when you stack 80 layers like LLaMA 3 70B.

There’s also a placement question: GPT-2 applied normalization after attention and FFN (post-norm). Modern models apply it before (pre-norm). Pre-norm training is more stable — gradients flow more directly through the residual stream without being squashed by normalization.

SwiGLU: The Feed-Forward Network That Won

After attention mixes information across tokens, the FFN processes each token independently. It’s where much of the model’s “knowledge” is stored — the factual associations learned during pre-training.

GPT-2 used a simple two-layer FFN with GELU activation. LLaMA introduced SwiGLU, which adds a gating mechanism — one projection computes a gate (with SiLU/Swish activation), another computes a value, and they’re multiplied element-wise before the down-projection.

class SwiGLU(nn.Module):
    """SwiGLU feed-forward network (LLaMA-style)."""

    def __init__(self, d_model, d_ff):
        super().__init__()
        self.gate = nn.Linear(d_model, d_ff, bias=False)
        self.up   = nn.Linear(d_model, d_ff, bias=False)
        self.down = nn.Linear(d_ff, d_model, bias=False)

    def forward(self, x):
        return self.down(F.silu(self.gate(x)) * self.up(x))

Notice SwiGLU has three weight matrices instead of two, so the parameter count is 3 × d_model × d_ff vs 2 × d_model × d_ff for standard FFN. LLaMA compensates by using a smaller d_ff ratio — about 2.67× d_model instead of the traditional 4×. The notebook benchmarks both: SwiGLU consistently outperforms standard FFN at equal parameter counts.

The Complete GPT Architecture

With attention, normalization, and FFN in hand, here’s the full picture. A GPT model is: token embeddings + positional embeddings, then N transformer blocks, then a final norm and linear projection to vocabulary logits.

@dataclass
class GPTConfig:
    vocab_size: int = 50257
    max_seq_len: int = 1024
    d_model: int = 768
    n_layers: int = 12
    n_heads: int = 12
    d_ff: int = 3072         # 4 × d_model
    dropout: float = 0.1

class TransformerBlock(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.norm1 = nn.LayerNorm(config.d_model)
        self.attn = CausalSelfAttention(config)
        self.norm2 = nn.LayerNorm(config.d_model)
        self.ffn = nn.Sequential(
            nn.Linear(config.d_model, config.d_ff),
            nn.GELU(),
            nn.Linear(config.d_ff, config.d_model),
            nn.Dropout(config.dropout),
        )

    def forward(self, x):
        # Pre-norm + residual connections
        x = x + self.attn(self.norm1(x))
        x = x + self.ffn(self.norm2(x))
        return x

class GPT(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.tok_emb = nn.Embedding(config.vocab_size, config.d_model)
        self.pos_emb = nn.Embedding(config.max_seq_len, config.d_model)
        self.blocks = nn.ModuleList(
            [TransformerBlock(config) for _ in range(config.n_layers)]
        )
        self.ln_f = nn.LayerNorm(config.d_model)
        self.head = nn.Linear(config.d_model, config.vocab_size, bias=False)

        # Weight tying: share embedding weights with output head
        self.head.weight = self.tok_emb.weight

    def forward(self, idx):
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = tok + pos
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        return self.head(x)

That’s a working GPT. A few things to notice in this implementation:

Weight tying — the token embedding matrix and the output projection head share the same weights. This saves ~50M parameters for GPT-2’s 50,257 vocabulary and empirically improves performance. The intuition: the embedding that maps “cat” to a vector should be related to the output vector that predicts “cat”.

Residual connections — the x = x + attn(norm(x)) pattern. Without residuals, a 12-layer transformer has a vanishing gradient problem. Residuals let the gradient flow directly from the output back to the input, creating a “gradient highway” through the network.

Pre-norm — normalization before attention/FFN rather than after. This is the modern standard. Post-norm (GPT-2 style) requires careful learning rate warmup; pre-norm trains reliably without it.

Where the Parameters Actually Live

For GPT-2 Small (124M parameters), the breakdown is roughly: ~65% in FFN layers, ~30% in attention projections, and ~5% in embeddings and norms. The FFN dominates because it’s 2 × d_model × d_ff = 2 × 768 × 3072 ≈ 4.7M per layer, times 12 layers. The notebook visualizes this breakdown across all GPT-2 sizes — the ratio stays remarkably consistent as models scale up.

Scaling Laws: Bigger Models Train Better

The Chinchilla scaling law (Hoffmann et al., 2022) showed that model parameters and training data should scale roughly equally. A 7B model should train on ~140B tokens. A 70B model needs ~1.4T tokens. Undertrain a large model and you waste compute. Overtrain a small model and you hit diminishing returns.

The notebook plots loss curves across model sizes and training budgets, showing the power-law relationship between compute and performance that has driven the industry’s scaling decisions.

LLaMA-Style: The Modern Recipe

If you were building a production LLM today, you wouldn’t use the GPT-2 architecture. You’d use the LLaMA recipe: RMSNorm instead of LayerNorm, SwiGLU instead of GELU FFN, Rotary Position Embeddings (RoPE) instead of learned positional embeddings, Grouped Query Attention instead of standard multi-head attention, and no bias terms in linear layers.

Each change is individually small but they compound. The notebook implements both architectures side-by-side — GPT-2 style and LLaMA style — so you can see exactly what changed and why each decision was made.

What to Do Next

The notebook includes the full GPT and LLaMA implementations, parameter breakdown visualizations, scaling law plots, and a working text generation loop that runs on a free Colab T4.

Open the notebook in Google Colab — runs on a free T4 GPU in about 60 minutes.

Next in this series: Tokenization from Scratch — how text becomes the integer sequences that transformers actually process, and why the choice of tokenizer matters more than most people think.

This post is part of TheAiSingularity’s LLM Engineering Course — 64 notebooks, 20 capstone projects, fully open source.