Tokenization from Scratch: BPE, SentencePiece, and Why It Matters

Every LLM starts the same way: raw text goes in, numbers come out. That conversion — tokenization — is step zero of the entire pipeline, and the choices made here ripple through everything downstream: model capacity, inference cost, multilingual performance, even whether the model can do basic arithmetic.

We’ll build a BPE tokenizer from scratch, then look at how production models like GPT-4 and LLaMA 3 handle it — and why the differences matter more than most people think.

Why Not Characters or Words?

Character-level tokenization gives you a tiny vocabulary (~256 tokens) but absurdly long sequences. The word “transformer” becomes 11 tokens. Attention is O(n²), so long sequences are expensive. Word-level tokenization has the opposite problem: a vocabulary of millions, and any word not in the vocabulary is unknown.

Subword tokenization hits the sweet spot. Common words like “the” stay as single tokens. Rare words like “unbelievable” get split into subwords like “un” + “believ” + “able”. The vocabulary stays manageable (32K–256K tokens) while handling any input text, including words the model has never seen.

BPE from Scratch: The Algorithm Every LLM Uses

Byte Pair Encoding starts with individual characters and iteratively merges the most frequent adjacent pair. After enough merges, common words become single tokens while rare words stay decomposed into subwords. Here’s a minimal implementation:

class SimpleBPE:
    """Minimal BPE tokenizer for educational purposes."""

    def __init__(self):
        self.merges = {}   # (pair) -> merged_token
        self.vocab = {}    # token -> id

    def get_stats(self, vocab):
        """Count frequency of all adjacent token pairs."""
        pairs = collections.defaultdict(int)
        for word, freq in vocab.items():
            symbols = word.split()
            for i in range(len(symbols) - 1):
                pairs[(symbols[i], symbols[i+1])] += freq
        return pairs

    def merge_vocab(self, pair, vocab):
        """Merge all occurrences of a pair in the vocabulary."""
        new_vocab = {}
        bigram = re.escape(' '.join(pair))
        pattern = re.compile(r'(?': freq
                 for word, freq in word_freq.items()}

        for i in range(num_merges):
            pairs = self.get_stats(vocab)
            if not pairs:
                break
            best_pair = max(pairs, key=pairs.get)
            vocab = self.merge_vocab(best_pair, vocab)
            self.merges[best_pair] = ''.join(best_pair)

The algorithm is simple: count pairs, merge the winner, repeat. After 30 merges on a small corpus, common pairs like t + h → th and th + e → the get merged first. The resulting vocabulary captures the statistical structure of the training data — frequent substrings become single tokens.

The notebook trains this BPE implementation on sample text and visualizes the merge order. You’ll see that the first merges are always the most common character pairs in English, and the later merges start capturing whole words.

Production Tokenizers: GPT-2 to LLaMA 3

Real models use optimized versions of BPE. The vocabulary sizes have grown significantly over the years: GPT-2 uses 50,257 tokens, GPT-4 jumped to 100,256, and LLaMA 3 uses 128,256. Larger vocabularies mean common words and phrases are more likely to be single tokens, which reduces sequence length and speeds up inference.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('gpt2')

text = "Hello, world! This is tokenization."
tokens = tokenizer.encode(text)
token_strings = tokenizer.convert_ids_to_tokens(tokens)

# Output:
# Tokens (ids): [15496, 11, 995, 0, 770, 318, 11241, 1634, 13]
# Tokens (strings): ['Hello', ',', 'Ġworld', '!', 'ĠThis', 'Ġis', 'Ġtoken',
#                     'ization', '.']

Notice the Ġ prefix — that’s GPT-2’s way of marking tokens that had a leading space. The word “tokenization” gets split into “token” + “ization” because the full word wasn’t frequent enough to earn its own token. This is BPE in action: the model handles the word it’s never seen by decomposing it into familiar subwords.

Tokenization Quirks That Break Things

Tokenization has surprising behaviors that directly affect what models can and can’t do. The notebook demonstrates several of these:

Leading spaces change everything. “dog” and ” dog” produce completely different token IDs. This is why prompt formatting matters — an extra space can shift the entire tokenization.

Numbers get split arbitrarily. The number “123456” might tokenize as [“123”, “456”] or [“1234”, “56”] depending on the tokenizer. This is why LLMs struggle with arithmetic — they don’t see numbers as numbers, they see arbitrary substrings. The model has to learn that “123” + “456” followed by “=” should produce “579” even though the digit boundaries don’t align with the token boundaries.

Non-English text is expensive. GPT-2’s tokenizer was trained primarily on English data, so a Chinese or Arabic sentence uses far more tokens than the equivalent English sentence. The notebook compares token counts across eight languages — Japanese text uses roughly 3× more tokens than English for the same content, which means 3× the compute cost and 3× the context window consumed.

Special Tokens: The Control Layer

Beyond regular text tokens, every model uses special tokens that provide structure: <bos> marks the start of a sequence, <eos> tells the model to stop generating, and <pad> fills shorter sequences in a batch to equal length. The attention mask then tells the model to ignore the padding positions — without it, the model would try to attend to meaningless pad tokens.

# Padding: making a batch of different-length sequences
texts = ["Hello", "Hello, world!", "This is a longer sentence."]
tok.pad_token = tok.eos_token  # GPT-2 has no pad token by default

batch = tok(texts, padding=True, return_tensors='pt')
# input_ids:      [[15496, 50256, 50256, 50256, 50256],
#                  [15496,    11,   995,     0, 50256],
#                  [ 1212,   318,   257,  2392, 6827]]
# attention_mask: [[1, 0, 0, 0, 0],
#                  [1, 1, 1, 1, 0],
#                  [1, 1, 1, 1, 1]]

The attention mask is a binary tensor: 1 for real tokens, 0 for padding. This gets multiplied into the attention scores so the model never attends to pad positions. Getting this wrong is a common source of subtle training bugs.

Chat Templates: How Models Know Who’s Talking

Instruction-tuned models need to distinguish between system prompts, user messages, and assistant responses. Each model family uses a different chat template format. LLaMA 3 uses special tokens like <|start_header_id|> and <|eot_id|> to mark role boundaries. ChatML uses <|im_start|> and <|im_end|>. The simpler Alpaca format just uses ### Instruction: and ### Response: text markers.

Using the wrong chat template is one of the most common mistakes when fine-tuning or deploying models. If you train with ChatML format but serve with Alpaca format, the model won’t behave as expected — it has never seen that token pattern during training. HuggingFace’s apply_chat_template method handles this automatically, and the notebook shows how each format looks under the hood.

Vocabulary Size: The Tradeoff

Vocabulary size is one of the most important architectural decisions. A larger vocabulary means shorter sequences (more words fit as single tokens), which speeds up inference and reduces attention cost. But a larger vocabulary also means a larger embedding matrix — the embedding layer for LLaMA 3’s 128K vocabulary at dimension 4096 is already ~500M parameters, about 7% of the 7B model.

The trend has been toward larger vocabularies. GPT-2 (2019) used 50K tokens. LLaMA 3 (2024) uses 128K. Gemma uses 256K. Larger vocabularies also help with multilingual performance — there’s room for more non-English subwords.

What to Do Next

The notebook includes the full BPE implementation, token visualization utilities, cross-language comparisons, and exercises on tokenization quirks. Everything runs on CPU — no GPU needed.

Open the notebook in Google Colab — runs entirely on CPU in about 60 minutes.

Next in this series: Embeddings Explained — how token IDs become the dense vector representations that transformers actually compute with.

This post is part of TheAiSingularity’s LLM Engineering Course — 64 notebooks, 20 capstone projects, fully open source.