Fine-Tuning LLMs: Full Fine-Tuning vs LoRA vs QLoRA
The complete guide to fine-tuning LLMs — from full fine-tuning to LoRA to QLoRA. Implement LoRA from scratch, use the PEFT library, and fine-tune 7B models on consumer GPUs.
Pretrained LLMs are general-purpose. To make them useful for your specific task — following instructions, classifying support tickets, generating code in your style — you need to fine-tune. The question is how: full fine-tuning updates every parameter and costs a fortune in GPU hours. LoRA updates less than 1% of parameters and gets you 95% of the quality. QLoRA adds 4-bit quantization so you can fine-tune a 7B model on a single consumer GPU.
We’ll implement LoRA from scratch, then use the PEFT library to fine-tune GPT-2 on instruction data — covering the full spectrum from full fine-tuning to QLoRA.
Full Fine-Tuning: The Expensive Baseline
Full fine-tuning updates every parameter in the model. For a 7B model in BF16, that means storing the model weights (14 GB), gradients (14 GB), and optimizer states (56 GB for AdamW) — roughly 84 GB of VRAM just to start. It produces the best results but most teams don’t have the hardware.
The notebook demonstrates full fine-tuning on two tasks: classification (DistilBERT on SST-2 sentiment data using HuggingFace Trainer) and instruction following (GPT-2 on instruction-response pairs in Alpaca format). Both use the same Trainer API — the only difference is the data format and the loss function.
Catastrophic Forgetting: The Hidden Cost
Fine-tuning has a dark side: the model “forgets” what it learned during pretraining. Train GPT-2 on instruction data for too many epochs, and its perplexity on general text goes up — it gets worse at things it used to do well. The notebook measures this directly, computing perplexity on both instruction-formatted text (should improve) and general text (should stay stable, but often doesn’t).
Mitigations: use a small learning rate (1e-5 to 3e-4), train for fewer epochs, mix in some pretraining data, or — better yet — use LoRA, which avoids catastrophic forgetting by design since the original weights stay frozen.
LoRA from Scratch: The Math Is Simple
LoRA’s key insight: weight updates during fine-tuning are approximately low-rank. Instead of updating the full weight matrix W, we learn a low-rank decomposition: ΔW = B × A, where A is (r × d_in) and B is (d_out × r). For rank r=8 and dimension 4096, that’s 65K parameters instead of 16.7M — a 256× reduction.
class LoRALinear(nn.Module):
"""LoRA-enhanced linear layer."""
def __init__(self, in_features, out_features, rank=8, alpha=16.0):
super().__init__()
self.scaling = alpha / rank
# Original frozen weight
self.weight = nn.Parameter(torch.zeros(out_features, in_features),
requires_grad=False)
# LoRA matrices: A is random-initialized, B is zero-initialized
self.lora_A = nn.Parameter(torch.zeros(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
def forward(self, x):
# Original output + low-rank update
base = F.linear(x, self.weight)
lora = x @ self.lora_A.T @ self.lora_B.T * self.scaling
return base + lora
B is initialized to zeros, so at the start of training ΔW = 0 and the LoRA model behaves identically to the base model. During training, only A and B are updated — the original weights W stay frozen. The alpha/rank scaling factor controls how much the LoRA update contributes; a common setting is alpha = 2 × rank.
Using PEFT in Practice
HuggingFace’s peft library handles the plumbing. You define a LoraConfig, call get_peft_model, and train normally with the Trainer API. The library injects LoRA adapters into the specified target modules and freezes everything else.
from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8, # Rank
lora_alpha=16, # Scaling factor
lora_dropout=0.05, # Regularization
target_modules=['c_attn', 'c_proj'], # Which layers get LoRA
bias='none',
)
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()
# trainable params: 294,912 || all params: 124,734,720 || trainable%: 0.24%
For GPT-2 with rank 8, LoRA adds only 0.24% trainable parameters. An important detail: LoRA can use a much higher learning rate than full fine-tuning (3e-4 vs 1e-5) because the frozen base weights act as a strong regularizer. The notebook shows this working end-to-end.
QLoRA: Fine-Tuning 7B Models on Consumer GPUs
QLoRA (Dettmers et al., 2023) combines 4-bit quantization with LoRA. The base model is loaded in NF4 (NormalFloat 4-bit) format, cutting VRAM from 14 GB to about 4 GB for a 7B model. LoRA adapters are kept in BF16 for full-precision gradient updates. The total VRAM for fine-tuning LLaMA 3 8B drops from ~84 GB (full FT) to about 5 GB (QLoRA).
# QLoRA setup for LLaMA 3 8B
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type='nf4', # NormalFloat quantization
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization
)
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Meta-Llama-3-8B',
quantization_config=bnb_config,
device_map='auto',
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
# trainable params: 83M || all params: 8B || trainable%: 1.04%
QLoRA introduced two innovations beyond basic quantization: NF4, a data type optimized for the distribution of neural network weights, and double quantization, which quantizes the quantization constants themselves. Together they reduce the memory overhead to about 0.5 bits per parameter with negligible quality loss.
Choosing LoRA Hyperparameters
Rank (r) — start with 8. Rank 4 works for simple tasks, rank 16–64 for complex ones. The notebook shows a rank sweep: quality improves sharply from r=1 to r=8, then plateaus. Going beyond r=32 rarely helps and can cause overfitting.
Target modules — at minimum, apply LoRA to the attention query and value projections (q_proj, v_proj). For more capacity, add all attention projections (k_proj, o_proj) and optionally the FFN layers (gate_proj, up_proj, down_proj). The notebook provides target module recommendations for GPT-2, LLaMA/Mistral, and BERT architectures.
Merging — after training, you can merge LoRA weights back into the base model with merge_and_unload(). This eliminates any inference overhead: the merged model is a standard transformer with no adapter logic. The notebook verifies that merged outputs are numerically identical to the LoRA model.
What to Do Next
The two notebooks cover the complete fine-tuning spectrum: full fine-tuning with HuggingFace Trainer, supervised instruction tuning, catastrophic forgetting measurement, LoRA from scratch, PEFT library usage, QLoRA configuration, rank sweeps, and weight merging.
Fine-Tuning notebook | LoRA & PEFT notebook — both run on a free Colab T4.
Next in this series: LLM Inference — decoding strategies, sampling parameters, and serving models efficiently with SGLang.
This post is part of TheAiSingularity’s LLM Engineering Course — 64 notebooks, 20 capstone projects, fully open source.
