8 min read

Memory Calculation of LLM on GPU

Table of Contents

Figuring out how much GPU memory an LLM needs is a prerequisite before planning any training or inference setup. This post breaks down memory into its components, gives formulas for each, and shows how distributed strategies (DP, TP, PP, EP) split them across GPUs.

1. Six Components of GPU Memory

Training and inference use different memory components. Training uses all six; inference mainly uses model weights and KV Cache.

ComponentSymbolTrainingInference
Model parametersMparamsM_{params}RequiredRequired
Optimizer statesMoptimM_{optim}RequiredNone
GradientsMgradM_{grad}RequiredNone
ActivationsMactM_{act}Required (can checkpoint)Minimal
KV CacheMkvM_{kv}NoneRequired
Temporary buffersMbufM_{buf}YesYes

2. Base Variables

Let PP be the model parameter count (number of parameters), bb be bytes per parameter.

SymbolMeaningExample
PPParameter count7B = 7×1097 \times 10^9
bbBytes per parameterFP32=4, FP16/BF16=2, INT8=1, INT4=0.5
LLTransformer layers32
ddHidden dimension4096
dffd_{ff}FFN intermediate dimension11008
nhn_hAttention head count32
nkvn_{kv}KV head count (GQA)8
VVVocabulary size32000
NNSequence length4096
BBBatch size32

3. Memory Formulas Per Component

3.1 Model Parameters

Mparams=P×bM_{params} = P \times b
ModelParamsFP32FP16INT8INT4
7B7B28 GB14 GB7 GB3.5 GB
13B13B52 GB26 GB13 GB6.5 GB
70B70B280 GB140 GB70 GB35 GB

3.2 Optimizer States (Training)

For AdamW, each parameter needs:

  • First moment estimate mm (FP32)
  • Second moment estimate vv (FP32)
  • Master weight copy (if mixed precision, needs FP32 copy)
Moptim=P×(4+4+4)=12P(mixed-precision AdamW)M_{optim} = P \times (4 + 4 + 4) = 12P \quad \text{(mixed-precision AdamW)}

For pure FP32 training:

Moptim=P×(4+4)=8P(FP32 AdamW, weights already in Mparams)M_{optim} = P \times (4 + 4) = 8P \quad \text{(FP32 AdamW, weights already in } M_{params} \text{)}

Mixed-precision training overhead: A 7B model with mixed-precision AdamW requires 7B×12=847B \times 12 = 84 GB for optimizer states alone. This is why training a 7B model needs at least one 80GB A100/H100.

3.3 Gradients (Training)

Gradient per parameter, matching training precision:

Mgrad=P×bgradM_{grad} = P \times b_{grad}

bgrad=2b_{grad} = 2 for FP16 training, bgrad=4b_{grad} = 4 for FP32 training.

3.4 Activations (Training)

Activations are intermediate results from the forward pass saved for backward. Rough estimate (without activation checkpointing):

Mact≈2×B×N×d×L×bactM_{act} \approx 2 \times B \times N \times d \times L \times b_{act}

This is an approximation. The full formula including per-layer attention matrices (O(BN2nh)O(BN^2n_h)):

Mact≈L×B×N×(34d+5nhN)×bactM_{act} \approx L \times B \times N \times (34d + 5n_h N) \times b_{act}

Where 34d34d comes from intermediate activations in each layer (Q, K, V, FFN intermediates, etc.), and 5nhN5n_h N comes from attention matrices (saved both pre- and post-softmax).

Activation Checkpointing significantly reduces MactM_{act}:

StrategyActivation MemoryExtra Compute
NoneMactM_{act}0
Per-layer checkpointMact/LM_{act} / L~33%
Full recomputationO(B×N×d)O(B \times N \times d)~100%

3.5 KV Cache (Inference)

Mkv=2×B×N×nkv×dh×L×bkvM_{kv} = 2 \times B \times N \times n_{kv} \times d_h \times L \times b_{kv}

Where dh=d/nhd_h = d / n_h.

With GQA (Grouped-Query Attention), nkv<nhn_{kv} < n_h, reducing KV Cache proportionally.

3.6 Temporary Buffers

NCCL communication buffers, CUDA workspace, etc. Typically about 1-2 GB, small relative to total but must be reserved.

4. Total Training Memory Estimate

Mtrain=Mparams+Moptim+Mgrad+Mact+MbufM_{train} = M_{params} + M_{optim} + M_{grad} + M_{act} + M_{buf}

For mixed-precision AdamW training of a 7B model (B=32, N=4096):

ComponentSize
Model params (FP16)14 GB
Optimizer states (FP32 m + v + master)84 GB
Gradients (FP16)14 GB
Activations (estimated)~60 GB
Buffers~2 GB
Total~174 GB

A single 80GB A100/H100 can’t fit this. That’s why distributed strategies are needed even for 7B model training.

5. Total Inference Memory Estimate

Minfer=Mparams+Mkv+MbufM_{infer} = M_{params} + M_{kv} + M_{buf}

Inference memory mainly depends on model precision and sequence length / batch size.

6. Distributed Strategies and Memory Splitting

6.1 Data Parallelism (DP)

ComponentSingle GPUDP (NdpN_{dp} GPUs)
Model paramsP×bP \times bP×bP \times b (full copy per GPU)
Optimizer states12P12P12P12P (full copy per GPU)
GradientsP×bgP \times b_gP×bgP \times b_g (full per GPU)
ActivationsMact(B)M_{act}(B)Mact(B/Ndp)M_{act}(B / N_{dp}) (batch split)

DP doesn’t reduce weight or optimizer memory, only activations (because batch is split).

ZeRO Optimization: DeepSpeed’s ZeRO shards optimizer states, gradients, and parameters across DP ranks:

ZeRO StageSharded ContentMemory Savings
Stage 1Optimizer states~4x
Stage 2+ Gradients~8x
Stage 3+ Parameters~NdpN_{dp}x

6.2 Tensor Parallelism (TP)

ComponentSingle GPUTP (NtpN_{tp} GPUs)
Model paramsP×bP \times bP×b/NtpP \times b / N_{tp}
Optimizer states12P12P12P/Ntp12P / N_{tp}
GradientsP×bgP \times b_gP×bg/NtpP \times b_g / N_{tp}
ActivationsMactM_{act}≈Mact/Ntp\approx M_{act} / N_{tp}

TP splits each layer’s parameter matrices along the hidden dimension; each GPU stores 1/Ntp1/N_{tp} of parameters. Cost: 2 all-reduce communications per layer.

6.3 Pipeline Parallelism (PP)

ComponentSingle GPUPP (NppN_{pp} GPUs)
Model paramsP×bP \times bP×b/NppP \times b / N_{pp}
Optimizer states12P12P12P/Npp12P / N_{pp}
ActivationsMact(L)M_{act}(L)Mact(L/Npp)M_{act}(L / N_{pp}) + bubble

PP splits the model by layers; each GPU holds L/NppL / N_{pp} layers. Cost: pipeline bubbles causing GPU idle time.

6.4 Expert Parallelism (EP)

For MoE models, EP distributes experts across GPUs:

Mexpert_per_card=E×PexpertNep×bM_{expert\_per\_card} = \frac{E \times P_{expert}}{N_{ep}} \times b

Where EE is total expert count, PexpertP_{expert} is parameters per expert.

Shared parameters (attention, embedding, etc.) still reside on every GPU:

Mmoe_card=Pshared×b+E×PexpertNep×bM_{moe\_card} = P_{shared} \times b + \frac{E \times P_{expert}}{N_{ep}} \times b

6.5 Combined Strategies

Production deployments typically combine multiple strategies. For example, 8 GPUs:

Total GPUs=Ndp×Ntp×Npp×Nep\text{Total GPUs} = N_{dp} \times N_{tp} \times N_{pp} \times N_{ep}

Per-GPU memory:

Mcard=MparamsNtp×Npp+MoptimNtp×Npp×NdpZeRO+Mact(Blocal,Llocal)+MbufM_{card} = \frac{M_{params}}{N_{tp} \times N_{pp}} + \frac{M_{optim}}{N_{tp} \times N_{pp} \times N_{dp}^{ZeRO}} + M_{act}(B_{local}, L_{local}) + M_{buf}

7. Concrete Example: Qwen-2.5-7B on H100

Model specs:

  • Parameters: 7.6B
  • Layers: 32, hidden dimension: 4096
  • FFN intermediate: 11008
  • Attention heads: 32, KV heads: 8 (GQA)
  • Vocabulary: 152000

Setup: 4x H100 80GB, FP16 inference, max context 32K

Single-GPU inference memory:

ComponentCalculationSize
Model weights (FP16)7.6B * 215.2 GB
KV Cache (B=16, N=32K)2 * 16 * 32768 * 8 * 128 * 32 * 232.8 GB
Buffers—~2 GB
Total~50 GB

Fits on a single H100 80GB. But larger batches or longer sequences need multi-GPU.

4-GPU optimization plans:

PlanTPWeights/GPUKV Cache/GPUTotal/GPUAvailable Batch
TP=1, 4 independent instances115.2 GBas needed~50 GBB=16 per card
TP=2, 2 instances27.6 GBas needed~40 GBB=32 per pair
TP=4, 1 instance43.8 GBas needed~35 GBB=64

TP tradeoffs: At TP=4, each GPU’s weight share is minimal (3.8 GB), leaving more room for KV Cache and thus larger batches. But TP communication overhead (2 all-reduces per layer) becomes noticeable at 4 GPUs, especially during decode (low compute, high communication ratio). The actual choice depends on whether you’re optimizing for maximum throughput or minimum latency.

8. Memory Optimization Flowchart

Memory optimization flowchart: decision tree for resolving OOM in training and inference scenarios

Summary

Useful rules of thumb:

  • FP16 inference: ~2 GB per 1B parameters
  • Mixed-precision training: ~18-20 GB per 1B parameters (including optimizer states and gradients)
  • KV Cache: Llama-7B at 4K context, batch=1 is about 2 GB; at 128K it’s about 64 GB
  • TP communication: between TP=2 and TP=4, communication doubles with diminishing returns
  • ZeRO Stage 2 is the default choice for training; Stage 3 only when the model doesn’t fit on a single GPU at all

Memory calculation isn’t an exact science — framework overhead, memory fragmentation, and CUDA context all consume extra space. But knowing these formulas gives you roughly 80% accuracy when planning hardware and configuration.