Free tool · runs in your browser · nothing sent anywhere

Will this LLM fit on my GPU?

The question everyone asks before buying a card or renting one. Pick a model and your hardware. We'll tell you if it fits, which quant to run, and how much context you can keep.

Model

Your GPU / VRAM

Quantization

Context length

8,192 tok

Pick a model and GPU

Model weights: —

KV cache (context): —

Overhead: —

Total needed: — of — available

Cheaper than guessing wrong. If it doesn't fit, or you want to test before buying a card, rent the exact GPU by the hour first. A couple of dollars beats an $800 mistake.
Rent on RunPod → · Rent on Vast.ai →

How this works

Three things use your VRAM: the model weights, the KV cache (which grows with context length), and a bit of overhead for the framework. We estimate weights from parameter count and quantization, size the KV cache from the model's architecture and your chosen context, and add a working-memory margin. Real usage varies by a few percent between llama.cpp, vLLM and others, so we keep a safety buffer rather than promising the last megabyte.

Everything runs in your browser. Nothing you pick is sent anywhere.

Fits? Now run it.

If your model fits, the next step is a frontend to actually use it. Our tested picks are in the self-hosted ChatGPT alternatives guide. If it doesn't fit, you've got three options: a smaller quant, a smaller model, or renting a bigger card by the hour.

Skip the setup

Once you know it fits, our SelfHost AI Stack kit gets the whole thing running in one command: chat UI, local models, and private web search, with a setup guide for people who've never touched Docker.

Get the kit — £29 →