GPU Fundamentals for LLM Engineers: CUDA, VRAM, and What Actually Matters
GPU architecture, CUDA basics, VRAM budgeting, and mixed precision training — the hardware fundamentals every LLM engineer needs to know.
GPU fundamentals for machine learning come down to one idea: GPUs process thousands of operations in parallel, and machine learning is almost entirely parallel math. Every forward pass through a neural network is a chain of matrix multiplications — exactly the workload GPUs were built to crush. In this post, you will learn how GPU architecture works, how to monitor your GPU with nvidia-smi, what CUDA cores and Tensor Cores actually do, how VRAM determines what models you can run, and how to execute your first GPU computation in PyTorch.
What You’ll Learn
- Why GPUs beat CPUs for ML workloads — the architecture difference in one diagram
- How to read
nvidia-smioutput and monitor GPU utilization, memory, and temperature - What CUDA cores, Tensor Cores, and Streaming Multiprocessors do inside a GPU
- How VRAM and data types (float32, float16, int8) determine what models fit on your hardware
- A working PyTorch script that moves tensors to GPU and benchmarks CPU vs GPU speed
Prerequisites: Basic Python knowledge. No GPU experience needed — this is the starting point.
Notebook: Open in Colab — runs on a free T4 GPU.
Time: ~20 min read, ~10 min to run the notebook.
Table of Contents
- CPU vs GPU: Why Architecture Matters
- How to Read nvidia-smi Output
- CUDA Cores, Tensor Cores, and Streaming Multiprocessors
- VRAM and GPU Memory Hierarchy
- GPU Selection Guide for Machine Learning
- Running Your First GPU Computation in PyTorch
- Common Mistakes and Troubleshooting
- Frequently Asked Questions
1. CPU vs GPU: Why Architecture Matters for Machine Learning
A CPU (Central Processing Unit) has 8 to 64 powerful cores, each optimized for fast sequential execution. It handles branching, complex logic, and single-threaded tasks extremely well. A GPU (Graphics Processing Unit) takes the opposite approach: thousands of smaller cores that each do less, but all work simultaneously.
Machine learning is dominated by matrix multiplications. Training a single layer of a neural network means multiplying a batch of inputs by a weight matrix — an operation that decomposes into thousands of independent multiply-add operations. A CPU processes these one (or a few) at a time. A GPU processes thousands at once.
# A simple matrix multiply: 1024x1024 * 1024x1024
# CPU does this sequentially across ~8 cores
# GPU does this across ~10,000 CUDA cores in parallel
import torch
A = torch.randn(1024, 1024)
B = torch.randn(1024, 1024)
C = A @ B # ~2 billion multiply-add operations
This single operation involves over 2 billion floating-point multiplications and additions. On a modern CPU, it takes roughly 50-100ms. On even a free-tier T4 GPU, it completes in under 1ms. That 50-100x speedup is why every serious ML workflow uses GPUs.
The key insight: CPUs are latency-optimized (finish one task fast), while GPUs are throughput-optimized (finish many tasks at once). Neural networks need throughput, not latency. This is why GPU fundamentals for machine learning matter from day one.
2. How to Read nvidia-smi Output
nvidia-smi (NVIDIA System Management Interface) is the first tool you should learn. It shows everything happening on your GPU in real time — utilization, memory usage, temperature, and running processes.
# Run this in any terminal or Colab cell
!nvidia-smi
Here is what each field in the output means:
- GPU-Util: Percentage of time the GPU cores are actively computing. During training, you want this at 90-100%. If it is low, your data pipeline is likely the bottleneck.
- Memory-Usage: How much VRAM is consumed vs. total available. Stay below 95% to avoid out-of-memory (OOM) crashes.
- Temp: GPU temperature in Celsius. Most GPUs throttle performance above 85C. Datacenter GPUs like the A100 stay cool with active cooling.
- Power: Current wattage. Useful for estimating electricity costs during long training runs.
For scripted monitoring, query specific fields in CSV format:
# Clean, parseable GPU stats
!nvidia-smi --query-gpu=name,memory.total,memory.used,utilization.gpu,temperature.gpu \
--format=csv,noheader,nounits
# Example output on a Colab T4:
# Tesla T4, 15360, 512, 0, 38
You can also monitor GPU stats from within Python using PyTorch’s CUDA API. This is especially useful for tracking memory during training loops:
import torch
if torch.cuda.is_available():
# Memory tracking
allocated = torch.cuda.memory_allocated() / 1e6 # MB
reserved = torch.cuda.memory_reserved() / 1e6 # MB
total = torch.cuda.get_device_properties(0).total_memory / 1e6
print(f"Allocated: {allocated:.1f} MB")
print(f"Reserved: {reserved:.1f} MB")
print(f"Total: {total:.0f} MB")
Allocated is the memory actively used by tensors. Reserved is memory PyTorch has claimed from the GPU (its memory pool). The difference is PyTorch’s cache — available for future allocations without asking the GPU driver.
3. CUDA Cores, Tensor Cores, and Streaming Multiprocessors
Inside every NVIDIA GPU, the compute hardware is organized into a hierarchy. Understanding it helps you reason about why certain operations are fast and others are slow.
CUDA Cores
CUDA cores are the basic processing units. Each one can perform a single floating-point multiply-add per clock cycle. A T4 GPU has 2,560 CUDA cores. An A100 has 6,912. An H100 has 16,896. More CUDA cores means more operations per second for general-purpose parallel compute.
Tensor Cores
Tensor Cores are specialized matrix-multiply units introduced in NVIDIA’s Volta architecture (2017). Instead of one multiply-add per cycle, a single Tensor Core performs a 4×4 matrix multiply-accumulate in one operation. This gives a massive throughput boost for the exact workload ML needs: matrix math at reduced precision (float16, bfloat16, int8).
When you enable mixed-precision training in PyTorch (using torch.cuda.amp), you are routing operations through Tensor Cores. This is why mixed precision can double training speed on Tensor Core-equipped GPUs. We cover this in detail in the CUDA and Parallelism post.
Streaming Multiprocessors (SMs)
Streaming Multiprocessors group CUDA cores and Tensor Cores together with shared memory and schedulers. Think of an SM as a mini-processor. An A100 has 108 SMs, each containing 64 CUDA cores and 4 Tensor Cores. A warp — 32 threads executing in lockstep — is the basic scheduling unit within an SM.
import torch
if torch.cuda.is_available():
props = torch.cuda.get_device_properties(0)
print(f"GPU: {props.name}")
print(f"Streaming Multiprocessors: {props.multi_processor_count}")
print(f"Compute Capability: {props.major}.{props.minor}")
print(f"Total VRAM: {props.total_memory / 1e9:.1f} GB")
Output (Colab T4): GPU: Tesla T4 | SMs: 40 | Compute Capability: 7.5 | VRAM: 15.8 GB
4. VRAM and GPU Memory Hierarchy
VRAM (Video RAM) is the GPU’s main memory. It determines the largest model you can load and the biggest batch size you can train with. Unlike system RAM (64-256 GB on a typical workstation), VRAM is limited: 16 GB on a T4, 40-80 GB on an A100, and 80 GB on an H100.
The memory hierarchy from fastest to slowest:
- Registers: Per-thread, ~256 KB per SM. Fastest, but tiny.
- Shared Memory / L1 Cache: Per-SM, ~164 KB on A100. Used for data shared between threads in the same block.
- L2 Cache: Shared across all SMs, ~40 MB on A100. Caches frequently accessed VRAM data.
- HBM (High Bandwidth Memory): The main VRAM. 80 GB on A100 with ~2 TB/s bandwidth. This is what
nvidia-smireports.
During training, VRAM holds four things: model parameters, optimizer states (Adam stores 2 extra copies of every parameter), gradients, and activations (intermediate results saved for the backward pass). For a 7B parameter model in float32, that is already 28 GB just for the weights — before optimizer states or activations.
Data Types Control Memory Usage
The data type (dtype) of your tensors directly controls VRAM consumption. Switching from float32 to float16 or bfloat16 cuts memory in half. Quantization to int8 cuts it by 4x.
import torch
n_params = 7_000_000_000 # 7 billion parameters
for dtype, name in [(torch.float32, "float32"),
(torch.bfloat16, "bfloat16"),
(torch.int8, "int8")]:
bytes_per_param = torch.tensor(0, dtype=dtype).element_size()
gb = n_params * bytes_per_param / 1e9
print(f"{name:10s}: {bytes_per_param} bytes/param = {gb:.1f} GB")
# float32 : 4 bytes/param = 28.0 GB
# bfloat16 : 2 bytes/param = 14.0 GB
# int8 : 1 bytes/param = 7.0 GB
bfloat16 (Brain Floating Point) is the preferred dtype for LLM training. It keeps the same exponent range as float32 (so it handles large and small values) while using half the memory. Google developed it specifically for deep learning workloads. We cover VRAM optimization strategies in a later post.
5. GPU Selection Guide for Machine Learning
Choosing a GPU depends on your workload. For learning and experimentation, a free Colab T4 is sufficient. For fine-tuning 7B+ models, you need at least 24 GB of VRAM. For pretraining, you need multi-GPU clusters.
| GPU | VRAM | Tensor Cores | Best For |
|---|---|---|---|
| T4 | 16 GB | 320 | Learning, inference, small fine-tunes |
| RTX 4090 | 24 GB | 512 | Local fine-tuning, hobbyist training |
| A100 | 40/80 GB | 432 | Production training, large fine-tunes |
| H100 | 80 GB | 528 | Pretraining, multi-GPU clusters |
Bottom line: Start with the free Colab T4. It has 16 GB of VRAM, Tensor Cores, and handles everything in this course. When you outgrow it, a cloud A100 instance (around $1-2/hour) is the best next step. Only buy a consumer GPU (RTX 4090) if you will use it daily — the 24 GB VRAM ceiling limits you to models under ~13B parameters. For current specs and pricing, see the NVIDIA CUDA GPUs page.
6. Running Your First GPU Computation in PyTorch
PyTorch makes GPU programming straightforward. The core concept: tensors live on a device (either cpu or cuda), and tensors must be on the same device to interact. Here is the standard pattern every PyTorch program uses:
import torch
# 1. Pick the best available device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using: {device}")
# 2. Create tensors directly on GPU
x = torch.randn(1000, 1000, device=device)
y = torch.randn(1000, 1000, device=device)
# 3. All operations happen on GPU automatically
z = x @ y # Matrix multiply — runs on GPU
print(f"Result shape: {z.shape}, device: {z.device}")
Output: Using: cuda | Result shape: torch.Size([1000, 1000]), device: cuda:0
To move existing tensors between CPU and GPU, use .to(device):
# Move tensors between devices
cpu_tensor = torch.randn(3, 3)
gpu_tensor = cpu_tensor.to(device) # CPU → GPU
back_to_cpu = gpu_tensor.cpu() # GPU → CPU
numpy_array = back_to_cpu.numpy() # Must be on CPU for NumPy
# Move an entire model to GPU
model = torch.nn.Linear(768, 768).to(device)
The same pattern applies to models. Call model.to(device) once after creating it, and all parameters move to the GPU. Every tensor you feed into the model must also be on the same device, or PyTorch raises a RuntimeError.
The full notebook includes a GPU vs CPU speed benchmark and a simulated training loop with memory monitoring. Open it in Colab to run everything hands-on.
7. Common Mistakes and Troubleshooting
| Mistake | What Happens | Fix |
|---|---|---|
| Mixing CPU and GPU tensors | RuntimeError: Expected all tensors to be on the same device | Use .to(device) on all inputs before passing to the model |
| Loading a model larger than VRAM | CUDA out of memory | Use a smaller model, reduce batch size, or switch to bfloat16/int8 |
Forgetting torch.cuda.empty_cache() | VRAM stays high after deleting tensors | Call del tensor then torch.cuda.empty_cache() |
Calling .numpy() on a GPU tensor | TypeError: can't convert cuda:0 tensor to numpy | Move to CPU first: tensor.cpu().numpy() |
| Not checking GPU availability | Code crashes on CPU-only machines | Always use the device = torch.device("cuda" if ...) pattern |
About the Author
TheAiSingularity — {{AUTHOR_ONE_LINE_CREDENTIALS}}. Building LLM Engineering: From Beginner to Advanced — a free, open-source course covering everything from attention math to production deployment. Code at github.com/TheAiSingularity.
Key Takeaways
- GPUs beat CPUs for ML because neural networks are parallel matrix math — exactly what thousands of CUDA cores are designed for
- nvidia-smi is your GPU dashboard — monitor utilization, VRAM, temperature, and power during every training run
- Tensor Cores accelerate matrix operations at reduced precision, making mixed-precision training up to 2x faster
- VRAM is the bottleneck — a 7B model needs 28 GB in float32 but only 14 GB in bfloat16 and 7 GB in int8
- Start free — a Colab T4 (16 GB VRAM) handles everything you need while learning
Run it: Open in Colab
Next: CUDA and Parallelism: How GPU Programs Execute →
Frequently Asked Questions
Do I need a GPU for machine learning?
For learning the basics, no — most introductory code runs fine on a CPU, just slower. For training or fine-tuning models with more than a few million parameters, a GPU is practically required. Google Colab provides a free T4 GPU that handles most educational workloads.
What GPU should I buy for machine learning?
For most people, buying a GPU is unnecessary. Cloud GPUs (Colab, Lambda Labs, RunPod) are more cost-effective unless you train models daily. If you do buy, the RTX 4090 (24 GB VRAM) offers the best consumer price-to-performance. The 24 GB ceiling lets you fine-tune models up to about 13B parameters with quantization.
What is the difference between CUDA cores and Tensor Cores?
CUDA cores are general-purpose processors that handle one floating-point operation per cycle. Tensor Cores are specialized hardware that perform entire 4×4 matrix multiplications in a single operation, delivering much higher throughput for the matrix math that dominates ML workloads. Tensor Cores require reduced-precision formats (float16, bfloat16) to activate.
How much VRAM do I need to train a large language model?
Multiply the parameter count by the bytes-per-parameter for your dtype, then multiply by 4x for training (parameters + gradients + optimizer states + activations). A 7B model in bfloat16 needs roughly 14 GB for weights alone and 56+ GB total for training. Techniques like LoRA, gradient checkpointing, and quantization reduce these requirements significantly.
Can I use AMD GPUs for machine learning?
Yes, through AMD’s ROCm platform, which provides a CUDA-compatible API. PyTorch supports ROCm officially. However, the ecosystem is less mature — many libraries, tutorials, and cloud providers assume NVIDIA CUDA. For beginners, sticking with NVIDIA GPUs (or Colab’s free T4) avoids compatibility friction.
LLM Engineering Series — Free, open-source course from scratch to production.
→ Next: CUDA and Parallelism: How GPU Programs Execute
Full curriculum | Star on GitHub
