Free tool · runs in your browser · nothing sent anywhere
Will this LLM fit on my GPU?
The question everyone asks before buying a card or renting one. Pick a model and your hardware. We'll tell you if it fits, which quant to run, and how much context you can keep.
Rent on RunPod → · Rent on Vast.ai →
How this works
Three things use your VRAM: the model weights, the KV cache (which grows with context length), and a bit of overhead for the framework. We estimate weights from parameter count and quantization, size the KV cache from the model's architecture and your chosen context, and add a working-memory margin. Real usage varies by a few percent between llama.cpp, vLLM and others, so we keep a safety buffer rather than promising the last megabyte.
Everything runs in your browser. Nothing you pick is sent anywhere.
Fits? Now run it.
If your model fits, the next step is a frontend to actually use it. Our tested picks are in the self-hosted ChatGPT alternatives guide. If it doesn't fit, you've got three options: a smaller quant, a smaller model, or renting a bigger card by the hour.
Skip the setup
Once you know it fits, our SelfHost AI Stack kit gets the whole thing running in one command: chat UI, local models, and private web search, with a setup guide for people who've never touched Docker.
Get the kit — £29 →