TOKREPO · ARSENAL
Stable

Run LLMs Locally

Ollama, GPT4All, MLC-LLM, Jan, Open WebUI, Text Generation WebUI, TGI — every flavor of "no API key, my GPU."

7 assets

What's in this pack

# Runner Sweet spot Backend
1 Ollama one-line CLI on Mac/Linux/Windows llama.cpp
2 GPT4All desktop app, no GPU required llama.cpp + GGUF
3 MLC-LLM iOS, Android, WebGPU TVM compiler
4 Jan desktop replacement for ChatGPT llama.cpp + remote APIs
5 Open WebUI ChatGPT-style UI for any OpenAI-compatible runner reverse-proxies Ollama/vLLM/TGI
6 Text Generation WebUI research-grade UI with LoRA training transformers + ExLlama + llama.cpp
7 Hugging Face TGI production serving with continuous batching Rust + Python, multi-GPU

These seven runners cover the full spectrum: from "I want a chat window on my laptop" to "I'm putting Llama 3 behind a load balancer for 10k QPS."

Why local matters in 2026

Three forces have collapsed the cost gap between cloud APIs and self-hosted inference.

First, model quality. Open weights from Meta (Llama), Mistral, Qwen, and DeepSeek now match GPT-4-class capability on most reasoning and coding tasks. The penalty for not paying OpenAI is no longer a quality penalty.

Second, hardware. A single RTX 4090 runs Llama 3 70B at usable speed via llama.cpp's GGUF Q4 quantization. Apple Silicon laggards finally got unified memory — an M3 Max runs 70B locally without thermal throttle. Even mid-range gaming laptops handle 8B models in real time.

Third, privacy and compliance. Healthcare, legal, finance, and EU GDPR-bound shops can't send PII to a third-party API. Local inference is the only legal path. The same applies to coding agents — most enterprises ban Cursor/Copilot from touching proprietary repos.

Install in one command

# Install the whole pack
tokrepo install pack/local-llm-runners

# Or pick the one runner you actually need
tokrepo install ollama
tokrepo install open-webui
tokrepo install tgi

Each asset's TokRepo page bundles the install command, the recommended config, and the model-pull command for the most common Llama / Qwen / DeepSeek weights.

Common pitfalls

  • VRAM accounting: a "7B" model takes ~14 GB at FP16, ~4 GB at Q4. Always check the quantization file name before downloading.
  • Context window vs RAM: a 32k context on a 7B model can use as much VRAM as the weights themselves. Lower the context if you OOM.
  • Open WebUI on top of Ollama: Open WebUI talks OpenAI protocol, so you must enable the OpenAI compatibility endpoint on Ollama (OLLAMA_HOST=0.0.0.0) — many tutorials skip this.
  • TGI vs vLLM: TGI shines for HuggingFace-hosted models with sharded weights; vLLM is faster for raw throughput. Don't pick TGI just because it's older.
  • Model licensing: Llama 3 is permissive but not MIT. Check the license before commercial deployment, especially for downstream fine-tunes.

Relationship to other packs

The local-LLM-runners pack is the runtime layer. To make it useful end-to-end:

  • Pair with the AI Second Brain pack — Logseq + Khoj indexing your notes against a local Ollama.
  • Pair with LLM Eval & Guardrails to verify your local model isn't regressing vs the closed-source baseline.
  • Pair with the Document AI Pipeline to feed PDFs into local inference instead of sending them to a vendor.

Together these three packs give you a fully air-gapped knowledge stack that never phones home. The boundary is clean: runners do inference, the eval pack scores quality, the second-brain pack handles retrieval, and the doc pipeline turns files into chunks. Mix and match by your privacy and latency targets, then layer Ollama or TGI underneath as the engine.

When to pick which runner

  • Single-developer laptop, mostly chat: Ollama plus Jan as the UI. Five-minute install, GGUF Q4 weights, runs offline on the plane.
  • Team behind a VPN, shared GPU server: TGI or vLLM behind a load balancer, Open WebUI as the team-facing front end with SSO. One model, many users, no per-seat OpenAI bill.
  • Mobile app demo or browser-only inference: MLC-LLM. Compiles weights to WebGPU/Metal/Vulkan and runs without a server at all — useful for offline mobile prototypes.
  • Research lab fine-tuning on consumer GPUs: Text Generation WebUI. Built-in LoRA training, ExLlama backend, exotic loaders for the half-broken model checkpoints HuggingFace ships every week.
INSTALL · ONE COMMAND
$ tokrepo install pack/local-llm-runners
hand it to your agent — or paste it in your terminal
What's inside

7 assets in this pack

Skill#01
Ollama Model Library — Best AI Models for Local Use

Curated guide to the best models available on Ollama for coding, chat, and reasoning. Compare Llama, Mistral, Gemma, Phi, and Qwen models for local AI development.

by Skill Factory·160 views
$ tokrepo install ollama-model-library-best-ai-models-local-use-4cecf968
Config#02
GPT4All — Run LLMs Privately on Your Desktop

GPT4All runs large language models privately on everyday desktops and laptops without GPUs or API calls. 77.2K+ GitHub stars. Desktop app + Python SDK, LocalDocs for private data. MIT licensed.

by AI Open Source·128 views
$ tokrepo install gpt4all-run-llms-privately-your-desktop-f493abd9
Script#03
MLC-LLM — Universal LLM Deployment Engine

Deploy any LLM on any hardware — phones, browsers, GPUs, CPUs. Compiles models for native performance on iOS, Android, WebGPU, CUDA, Metal, and Vulkan. 22K+ stars.

by Script Depot·102 views
$ tokrepo install mlc-llm-universal-llm-deployment-engine-735f5a27
Config#04
Text Generation WebUI — Local LLM Chat Interface

Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.

by AI Open Source·104 views
$ tokrepo install text-generation-webui-local-llm-chat-interface-11107806
Config#05
Jan — Offline AI Desktop App with Full Privacy

Jan is an open-source ChatGPT alternative that runs LLMs locally with full privacy. 41.4K+ GitHub stars. Desktop app for Windows/macOS/Linux, OpenAI-compatible API, MCP support. Apache 2.0.

by AI Open Source·103 views
$ tokrepo install jan-offline-ai-desktop-app-full-privacy-7b703194
Script#06
Open WebUI — Self-Hosted AI Chat Interface

User-friendly, self-hosted AI chat interface. Supports Ollama, OpenAI, Anthropic, and any OpenAI-compatible API. RAG, web search, voice, image gen, and plugins. 129K+ stars.

by Script Depot·96 views
$ tokrepo install open-webui-self-hosted-ai-chat-interface-5d37ffb8
Script#07
Text Generation Inference (TGI) — Hugging Face Production LLM Server

TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.

by Hugging Face·161 views
$ tokrepo install text-generation-inference-tgi-hugging-face-production-llm-e08ad222
FAQ

Frequently asked questions

Is this stack really free, or are there hidden costs?

All seven runners are open-source and free to install. The cost is hardware — you need a GPU with enough VRAM for the model weights you choose. A consumer RTX 3090/4090 (24GB) handles 7B-13B models fluidly and 70B with aggressive quantization. M-series Macs work via Metal. Cloud GPU rental on Runpod or Vast.ai stays well under OpenAI API pricing for sustained workloads.

Which runner should I start with — Ollama or Jan?

Ollama if you live in the terminal and want OpenAI-compatible HTTP for your apps. Jan if you want a one-click desktop chat experience that mirrors ChatGPT. Many users run both: Ollama as the engine, Jan or Open WebUI as the UI. They share GGUF model files via Ollama's local model store.

Will these work with Cursor or Codex CLI?

Yes — both Cursor and Codex CLI accept any OpenAI-compatible endpoint. Point them at http://localhost:11434/v1 (Ollama) or whichever port your runner exposes. Cursor calls this Custom OpenAI URL in settings. The catch: local 7B models trail GPT-4 on long-context refactors, so use a 70B+ if you want production-quality code edits.

How does this differ from the LLM Eval & Guardrails pack?

This pack is the runtime that serves the model. The eval pack scores model output. They're complementary: install a runner here, then point DeepEval/Promptfoo at it to verify quality before swapping a cloud model for a local one. Most teams that go local need both packs.

What's the biggest gotcha after install?

Forgetting to set the context window to match your VRAM budget. Defaults are conservative (2k-4k), but if you load a 32k-trained model and pump it full of context, the KV cache balloons and you OOM mid-generation. Always check nvidia-smi during a real workload before going to production.

MORE FROM THE ARSENAL

12 packs · 80+ hand-picked assets

Browse every curated bundle on the home page

Back to all packs