Tag grouped-query-attention

Transformer Architecture

Building a Transformer Block: From Attention to Complete GPT

Build a complete decoder-only transformer from scratch — RMSNorm, SwiGLU, residual connections, and the full GPT architecture. Working PyTorch code included.

singularitytheai@gmail.com
March 30, 2026

Transformer Architecture

Self-Attention from Scratch: The Core of Every LLM

Build scaled dot-product attention, multi-head attention, causal masking, KV cache, and grouped query attention from scratch in PyTorch. The fundamental operation behind GPT-4, LLaMA 3, and every modern language model.

singularitytheai@gmail.com
March 30, 2026