[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-local-llm-runners-en":3,"seo:pack:local-llm-runners:en":74},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":73},"local-llm-runners","💻","#374151","stable","Stable","Run LLMs Locally","Ollama, GPT4All, MLC-LLM, Jan, Open WebUI, Text Generation WebUI, TGI — every flavor of \"no API key, my GPU.\"",[16,28,36,44,51,58,65],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},771,"4cecf968-aa84-47ec-9f32-c3b11432c18f","ollama-model-library-best-ai-models-local-use-4cecf968","Ollama Model Library — Best AI Models for Local Use","Curated guide to the best models available on Ollama for coding, chat, and reasoning. Compare Llama, Mistral, Gemma, Phi, and Qwen models for local AI development.","Skill Factory",490,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":34,"view_count":35,"vote_count":24,"lang_type":25,"type":26,"type_label":27},274,"f493abd9-0870-49b3-a04b-719ee2a5df0f","gpt4all-run-llms-privately-your-desktop-f493abd9","GPT4All — Run LLMs Privately on Your Desktop","GPT4All runs large language models privately on everyday desktops and laptops without GPUs or API calls. 77.2K+ GitHub stars. Desktop app + Python SDK, LocalDocs for private data. MIT licensed.","AI Open Source",303,{"id":37,"uuid":38,"slug":39,"title":40,"description":41,"author_name":42,"view_count":43,"vote_count":24,"lang_type":25,"type":26,"type_label":27},232,"735f5a27-07d6-47ac-8377-e29be76a9452","mlc-llm-universal-llm-deployment-engine-735f5a27","MLC-LLM — Universal LLM Deployment Engine","Deploy any LLM on any hardware — phones, browsers, GPUs, CPUs. Compiles models for native performance on iOS, Android, WebGPU, CUDA, Metal, and Vulkan. 22K+ stars.","Script Depot",338,{"id":45,"uuid":46,"slug":47,"title":48,"description":49,"author_name":34,"view_count":50,"vote_count":24,"lang_type":25,"type":26,"type_label":27},282,"11107806-c69a-4b75-8360-d0504ff602d7","text-generation-webui-local-llm-chat-interface-11107806","Text Generation WebUI — Local LLM Chat Interface","Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.",401,{"id":52,"uuid":53,"slug":54,"title":55,"description":56,"author_name":34,"view_count":57,"vote_count":24,"lang_type":25,"type":26,"type_label":27},278,"7b703194-ec0f-4244-a98e-3ec206a883b8","jan-offline-ai-desktop-app-full-privacy-7b703194","Jan — Offline AI Desktop App with Full Privacy","Jan is an open-source ChatGPT alternative that runs LLMs locally with full privacy. 41.4K+ GitHub stars. Desktop app for Windows\u002FmacOS\u002FLinux, OpenAI-compatible API, MCP support. Apache 2.0.",323,{"id":59,"uuid":60,"slug":61,"title":62,"description":63,"author_name":42,"view_count":64,"vote_count":24,"lang_type":25,"type":26,"type_label":27},218,"5d37ffb8-d351-4fb1-8665-bef4db25b275","open-webui-self-hosted-ai-chat-interface-5d37ffb8","Open WebUI — Self-Hosted AI Chat Interface","User-friendly, self-hosted AI chat interface. Supports Ollama, OpenAI, Anthropic, and any OpenAI-compatible API. RAG, web search, voice, image gen, and plugins. 129K+ stars.",348,{"id":66,"uuid":67,"slug":68,"title":69,"description":70,"author_name":71,"view_count":72,"vote_count":24,"lang_type":25,"type":26,"type_label":27},1303,"e08ad222-37db-11f1-9bc6-00163e2b0d79","text-generation-inference-tgi-hugging-face-production-llm-e08ad222","Text Generation Inference (TGI) — Hugging Face Production LLM Server","TGI is Hugging Face's production-grade LLM inference server. It powers HF Inference Endpoints with continuous batching, tensor parallelism, quantization, and OpenAI-compatible APIs — handling thousands of requests per second.","Hugging Face",434,"tokrepo install pack\u002Flocal-llm-runners",{"pageType":75,"pageKey":8,"locale":25,"title":76,"metaDescription":77,"h1":13,"tldr":78,"bodyMarkdown":79,"faq":80,"schema":96,"internalLinks":105,"citations":118,"wordCount":131,"generatedAt":132},"pack","Run LLMs Locally: 7 Open-Source Runners (Ollama, TGI, Jan)","Seven open-source runners — Ollama, GPT4All, MLC-LLM, Jan, Open WebUI, Text Generation WebUI, TGI — to serve LLMs on your own GPU. No API key needed.","Seven battle-tested open-source runtimes covering every flavor of local inference — laptop chat, GPU server, mobile, web UI, production endpoint. One TokRepo command installs the whole pack.","## What's in this pack\n\n| # | Runner | Sweet spot | Backend |\n|---|---|---|---|\n| 1 | Ollama | one-line CLI on Mac\u002FLinux\u002FWindows | llama.cpp |\n| 2 | GPT4All | desktop app, no GPU required | llama.cpp + GGUF |\n| 3 | MLC-LLM | iOS, Android, WebGPU | TVM compiler |\n| 4 | Jan | desktop replacement for ChatGPT | llama.cpp + remote APIs |\n| 5 | Open WebUI | ChatGPT-style UI for any OpenAI-compatible runner | reverse-proxies Ollama\u002FvLLM\u002FTGI |\n| 6 | Text Generation WebUI | research-grade UI with LoRA training | transformers + ExLlama + llama.cpp |\n| 7 | Hugging Face TGI | production serving with continuous batching | Rust + Python, multi-GPU |\n\nThese seven runners cover the full spectrum: from \"I want a chat window on my laptop\" to \"I'm putting Llama 3 behind a load balancer for 10k QPS.\"\n\n## Why local matters in 2026\n\nThree forces have collapsed the cost gap between cloud APIs and self-hosted inference.\n\nFirst, model quality. Open weights from Meta (Llama), Mistral, Qwen, and DeepSeek now match GPT-4-class capability on most reasoning and coding tasks. The penalty for not paying OpenAI is no longer a quality penalty.\n\nSecond, hardware. A single RTX 4090 runs Llama 3 70B at usable speed via llama.cpp's GGUF Q4 quantization. Apple Silicon laggards finally got unified memory — an M3 Max runs 70B locally without thermal throttle. Even mid-range gaming laptops handle 8B models in real time.\n\nThird, privacy and compliance. Healthcare, legal, finance, and EU GDPR-bound shops can't send PII to a third-party API. Local inference is the only legal path. The same applies to coding agents — most enterprises ban Cursor\u002FCopilot from touching proprietary repos.\n\n## Install in one command\n\n```bash\n# Install the whole pack\ntokrepo install pack\u002Flocal-llm-runners\n\n# Or pick the one runner you actually need\ntokrepo install ollama\ntokrepo install open-webui\ntokrepo install tgi\n```\n\nEach asset's TokRepo page bundles the install command, the recommended config, and the model-pull command for the most common Llama \u002F Qwen \u002F DeepSeek weights.\n\n## Common pitfalls\n\n- **VRAM accounting**: a \"7B\" model takes ~14 GB at FP16, ~4 GB at Q4. Always check the quantization file name before downloading.\n- **Context window vs RAM**: a 32k context on a 7B model can use as much VRAM as the weights themselves. Lower the context if you OOM.\n- **Open WebUI on top of Ollama**: Open WebUI talks OpenAI protocol, so you must enable the OpenAI compatibility endpoint on Ollama (`OLLAMA_HOST=0.0.0.0`) — many tutorials skip this.\n- **TGI vs vLLM**: TGI shines for HuggingFace-hosted models with sharded weights; vLLM is faster for raw throughput. Don't pick TGI just because it's older.\n- **Model licensing**: Llama 3 is permissive but not MIT. Check the license before commercial deployment, especially for downstream fine-tunes.\n\n## Relationship to other packs\n\nThe local-LLM-runners pack is the *runtime* layer. To make it useful end-to-end:\n\n- Pair with the **AI Second Brain** pack — Logseq + Khoj indexing your notes against a local Ollama.\n- Pair with **LLM Eval & Guardrails** to verify your local model isn't regressing vs the closed-source baseline.\n- Pair with the **Document AI Pipeline** to feed PDFs into local inference instead of sending them to a vendor.\n\nTogether these three packs give you a fully air-gapped knowledge stack that never phones home. The boundary is clean: runners do inference, the eval pack scores quality, the second-brain pack handles retrieval, and the doc pipeline turns files into chunks. Mix and match by your privacy and latency targets, then layer Ollama or TGI underneath as the engine.\n\n## When to pick which runner\n\n- **Single-developer laptop, mostly chat**: Ollama plus Jan as the UI. Five-minute install, GGUF Q4 weights, runs offline on the plane.\n- **Team behind a VPN, shared GPU server**: TGI or vLLM behind a load balancer, Open WebUI as the team-facing front end with SSO. One model, many users, no per-seat OpenAI bill.\n- **Mobile app demo or browser-only inference**: MLC-LLM. Compiles weights to WebGPU\u002FMetal\u002FVulkan and runs without a server at all — useful for offline mobile prototypes.\n- **Research lab fine-tuning on consumer GPUs**: Text Generation WebUI. Built-in LoRA training, ExLlama backend, exotic loaders for the half-broken model checkpoints HuggingFace ships every week.",[81,84,87,90,93],{"q":82,"a":83},"Is this stack really free, or are there hidden costs?","All seven runners are open-source and free to install. The cost is hardware — you need a GPU with enough VRAM for the model weights you choose. A consumer RTX 3090\u002F4090 (24GB) handles 7B-13B models fluidly and 70B with aggressive quantization. M-series Macs work via Metal. Cloud GPU rental on Runpod or Vast.ai stays well under OpenAI API pricing for sustained workloads.",{"q":85,"a":86},"Which runner should I start with — Ollama or Jan?","Ollama if you live in the terminal and want OpenAI-compatible HTTP for your apps. Jan if you want a one-click desktop chat experience that mirrors ChatGPT. Many users run both: Ollama as the engine, Jan or Open WebUI as the UI. They share GGUF model files via Ollama's local model store.",{"q":88,"a":89},"Will these work with Cursor or Codex CLI?","Yes — both Cursor and Codex CLI accept any OpenAI-compatible endpoint. Point them at http:\u002F\u002Flocalhost:11434\u002Fv1 (Ollama) or whichever port your runner exposes. Cursor calls this Custom OpenAI URL in settings. The catch: local 7B models trail GPT-4 on long-context refactors, so use a 70B+ if you want production-quality code edits.",{"q":91,"a":92},"How does this differ from the LLM Eval & Guardrails pack?","This pack is the runtime that serves the model. The eval pack scores model output. They're complementary: install a runner here, then point DeepEval\u002FPromptfoo at it to verify quality before swapping a cloud model for a local one. Most teams that go local need both packs.",{"q":94,"a":95},"What's the biggest gotcha after install?","Forgetting to set the context window to match your VRAM budget. Defaults are conservative (2k-4k), but if you load a 32k-trained model and pump it full of context, the KV cache balloons and you OOM mid-generation. Always check `nvidia-smi` during a real workload before going to production.",{"@context":97,"@type":98,"name":13,"description":99,"numberOfItems":100,"publisher":101},"https:\u002F\u002Fschema.org","CollectionPage","Seven open-source runners that let you serve LLMs on your own GPU or laptop, no API key required.",7,{"@type":102,"name":103,"url":104},"Organization","TokRepo","https:\u002F\u002Ftokrepo.com",[106,110,114],{"url":107,"anchor":108,"reason":109},"\u002Fen\u002Fpacks\u002Fai-second-brain","AI Second Brain","local LLMs are the privacy backend",{"url":111,"anchor":112,"reason":113},"\u002Fen\u002Fpacks\u002Fllm-eval-guardrails","LLM Eval & Guardrails","evaluate local model quality",{"url":115,"anchor":116,"reason":117},"\u002Fen\u002Ftools\u002Follama","Ollama","the most popular runner in this pack",[119,123,127],{"claim":120,"source_name":121,"source_url":122},"Ollama is an open-source local LLM runtime with a public model library","ollama\u002Follama on GitHub","https:\u002F\u002Fgithub.com\u002Follama\u002Follama",{"claim":124,"source_name":125,"source_url":126},"Hugging Face Text Generation Inference (TGI) is the production-grade serving backend","huggingface\u002Ftext-generation-inference","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference",{"claim":128,"source_name":129,"source_url":130},"Open WebUI provides a ChatGPT-style UI on top of any OpenAI-compatible runner","open-webui\u002Fopen-webui","https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui",715,"2026-05-02T15:00:00Z"]