LLM Releases

Local model basics

All guides

How to size local LLM hardware

A practical guide to the hardware signals in model release pages: parameters, active parameters, quantization, context windows, and memory overhead.

Quick answer

Start with model size, then add headroom. A rough 4-bit estimate is about 0.55GB per billion parameters for weights only; FP16 is about 2.05GB per billion.

If you want a simple buying rule: 24GB VRAM is a strong local experimentation tier; 48GB+ is where larger open models become much more comfortable.

Hardware tiers

TierPractical rangeWhat to watch
8GB VRAMSmall 3B-8B quantized modelsGood for experiments, chat, and light coding. Keep context modest.
12GB VRAMMany 7B-12B quantized modelsMore comfortable for longer prompts and better quantization choices.
16GB VRAM7B-14B quantized models, some 20B-class with tradeoffsA useful baseline for local development.
24GB VRAM14B-32B quantized modelsStrong single-GPU tier for quality local testing.
48GB VRAM32B-70B quantized modelsBetter for larger coding/reasoning models and longer contexts.
80GB VRAMLarge dense models or bigger MoE checkpointsSerious workstation/server tier; still watch KV cache.
Mac unified memoryDepends on free unified memory and backend supportDo not map 1:1 to GPU VRAM; leave headroom for the OS.

Parameters, active parameters, and context

Parameter count is the fastest rough proxy for memory. A dense 32B model usually needs more memory than a dense 8B model. MoE models add a wrinkle: total parameters describe the full checkpoint, while active parameters describe the portion used per token.

Context window is the other major signal. Longer contexts increase KV-cache memory and latency. A model can fit at short context and become impractical at very long context.

Catalog examples with rough memory

ModelParamsContext4-bit roughFP16 roughReleased
LFM2 1.2B1.17B33K1GB3GBNov 28, 2025
SmolLM2 1.7B1.7B1GB4GBNov 4, 2024
Stable LM 2 1.6B1.6B1GB4GBJan 19, 2024
TinyLlama 1.1B Chat1.1B1GB3GBJan 1, 2024
Phi-11.3B1GB3GBJun 21, 2023
GPT-21.5B1K1GB4GBNov 5, 2019
BERT0.34B5121GB1GBOct 11, 2018
SmolLM3 3B3B128K2GB7GBJul 8, 2025

These estimates are weight-only. Real deployments need extra memory for runtime overhead, KV cache, batching, and the operating system.

Frequently asked questions

How much VRAM do I need to run a local LLM?

It depends on model size, quantization, context length, and serving overhead. As a rough starting point, 8GB is for small models, 16GB opens up many 7B-14B models, 24GB is a strong hobbyist tier, and 48GB+ is better for larger models.

Do active parameters matter for MoE models?

Yes for compute per token, but total parameters still matter for storing and loading the full checkpoint. Treat active parameters as a throughput clue, not a complete memory estimate.

Why can a model fit but still run poorly?

Weight memory is only part of the requirement. Long contexts, KV cache, batching, CPU offload, runtime overhead, and slow memory bandwidth can make a technically fitting model impractical.

Where to go next

Use the local LLM shortlist for model-by-model rough estimates, browse the broader local LLM catalog, or compare coding-focused downloadable models in open coding models.