Run LLMs Locally
Ollama, GPT4All, MLC-LLM, Jan, Open WebUI, Text Generation WebUI, TGI — every flavor of "no API key, my GPU."
What's in this pack
| # | Runner | Sweet spot | Backend |
|---|---|---|---|
| 1 | Ollama | one-line CLI on Mac/Linux/Windows | llama.cpp |
| 2 | GPT4All | desktop app, no GPU required | llama.cpp + GGUF |
| 3 | MLC-LLM | iOS, Android, WebGPU | TVM compiler |
| 4 | Jan | desktop replacement for ChatGPT | llama.cpp + remote APIs |
| 5 | Open WebUI | ChatGPT-style UI for any OpenAI-compatible runner | reverse-proxies Ollama/vLLM/TGI |
| 6 | Text Generation WebUI | research-grade UI with LoRA training | transformers + ExLlama + llama.cpp |
| 7 | Hugging Face TGI | production serving with continuous batching | Rust + Python, multi-GPU |
These seven runners cover the full spectrum: from "I want a chat window on my laptop" to "I'm putting Llama 3 behind a load balancer for 10k QPS."
Why local matters in 2026
Three forces have collapsed the cost gap between cloud APIs and self-hosted inference.
First, model quality. Open weights from Meta (Llama), Mistral, Qwen, and DeepSeek now match GPT-4-class capability on most reasoning and coding tasks. The penalty for not paying OpenAI is no longer a quality penalty.
Second, hardware. A single RTX 4090 runs Llama 3 70B at usable speed via llama.cpp's GGUF Q4 quantization. Apple Silicon laggards finally got unified memory — an M3 Max runs 70B locally without thermal throttle. Even mid-range gaming laptops handle 8B models in real time.
Third, privacy and compliance. Healthcare, legal, finance, and EU GDPR-bound shops can't send PII to a third-party API. Local inference is the only legal path. The same applies to coding agents — most enterprises ban Cursor/Copilot from touching proprietary repos.
Install in one command
# Install the whole pack
tokrepo install pack/local-llm-runners
# Or pick the one runner you actually need
tokrepo install ollama
tokrepo install open-webui
tokrepo install tgi
Each asset's TokRepo page bundles the install command, the recommended config, and the model-pull command for the most common Llama / Qwen / DeepSeek weights.
Common pitfalls
- VRAM accounting: a "7B" model takes ~14 GB at FP16, ~4 GB at Q4. Always check the quantization file name before downloading.
- Context window vs RAM: a 32k context on a 7B model can use as much VRAM as the weights themselves. Lower the context if you OOM.
- Open WebUI on top of Ollama: Open WebUI talks OpenAI protocol, so you must enable the OpenAI compatibility endpoint on Ollama (
OLLAMA_HOST=0.0.0.0) — many tutorials skip this. - TGI vs vLLM: TGI shines for HuggingFace-hosted models with sharded weights; vLLM is faster for raw throughput. Don't pick TGI just because it's older.
- Model licensing: Llama 3 is permissive but not MIT. Check the license before commercial deployment, especially for downstream fine-tunes.
Relationship to other packs
The local-LLM-runners pack is the runtime layer. To make it useful end-to-end:
- Pair with the AI Second Brain pack — Logseq + Khoj indexing your notes against a local Ollama.
- Pair with LLM Eval & Guardrails to verify your local model isn't regressing vs the closed-source baseline.
- Pair with the Document AI Pipeline to feed PDFs into local inference instead of sending them to a vendor.
Together these three packs give you a fully air-gapped knowledge stack that never phones home. The boundary is clean: runners do inference, the eval pack scores quality, the second-brain pack handles retrieval, and the doc pipeline turns files into chunks. Mix and match by your privacy and latency targets, then layer Ollama or TGI underneath as the engine.
When to pick which runner
- Single-developer laptop, mostly chat: Ollama plus Jan as the UI. Five-minute install, GGUF Q4 weights, runs offline on the plane.
- Team behind a VPN, shared GPU server: TGI or vLLM behind a load balancer, Open WebUI as the team-facing front end with SSO. One model, many users, no per-seat OpenAI bill.
- Mobile app demo or browser-only inference: MLC-LLM. Compiles weights to WebGPU/Metal/Vulkan and runs without a server at all — useful for offline mobile prototypes.
- Research lab fine-tuning on consumer GPUs: Text Generation WebUI. Built-in LoRA training, ExLlama backend, exotic loaders for the half-broken model checkpoints HuggingFace ships every week.
7 assets in this pack
Frequently asked questions
Is this stack really free, or are there hidden costs?
All seven runners are open-source and free to install. The cost is hardware — you need a GPU with enough VRAM for the model weights you choose. A consumer RTX 3090/4090 (24GB) handles 7B-13B models fluidly and 70B with aggressive quantization. M-series Macs work via Metal. Cloud GPU rental on Runpod or Vast.ai stays well under OpenAI API pricing for sustained workloads.
Which runner should I start with — Ollama or Jan?
Ollama if you live in the terminal and want OpenAI-compatible HTTP for your apps. Jan if you want a one-click desktop chat experience that mirrors ChatGPT. Many users run both: Ollama as the engine, Jan or Open WebUI as the UI. They share GGUF model files via Ollama's local model store.
Will these work with Cursor or Codex CLI?
Yes — both Cursor and Codex CLI accept any OpenAI-compatible endpoint. Point them at http://localhost:11434/v1 (Ollama) or whichever port your runner exposes. Cursor calls this Custom OpenAI URL in settings. The catch: local 7B models trail GPT-4 on long-context refactors, so use a 70B+ if you want production-quality code edits.
How does this differ from the LLM Eval & Guardrails pack?
This pack is the runtime that serves the model. The eval pack scores model output. They're complementary: install a runner here, then point DeepEval/Promptfoo at it to verify quality before swapping a cloud model for a local one. Most teams that go local need both packs.
What's the biggest gotcha after install?
Forgetting to set the context window to match your VRAM budget. Defaults are conservative (2k-4k), but if you load a 32k-trained model and pump it full of context, the KV cache balloons and you OOM mid-generation. Always check nvidia-smi during a real workload before going to production.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs