[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"pack-detail-privacy-first-local-ai-en":3,"seo:pack:privacy-first-local-ai:en":92},{"code":4,"message":5,"data":6},200,"操作成功",{"pack":7},{"slug":8,"icon":9,"tone":10,"status":11,"status_label":12,"title":13,"description":14,"items":15,"install_cmd":91},"privacy-first-local-ai","🔒","#059669","new","New · this week","Privacy-First Local AI — A 100% On-Device AI Stack","Ten open-source picks for industries where data cannot leave the building — law firms, hospitals, government, cross-border trade. Runtime + UI + RAG + protocol layer, all on-prem. No vendor API, no telemetry, no \"we may use your prompts to improve our service.\"",[16,28,36,43,51,58,65,71,78,84],{"id":17,"uuid":18,"slug":19,"title":20,"description":21,"author_name":22,"view_count":23,"vote_count":24,"lang_type":25,"type":26,"type_label":27},162,"0eefb7ad-754e-4f35-8967-586ebf4c2a6a","ollama-run-llms-locally-0eefb7ad","Ollama — Run LLMs Locally","Run large language models locally on your machine. Supports Llama 3, Mistral, Gemma, Phi, and dozens more. One-command install, OpenAI-compatible API.","Script Depot",197,0,"en","skill","Skill",{"id":29,"uuid":30,"slug":31,"title":32,"description":33,"author_name":34,"view_count":35,"vote_count":24,"lang_type":25,"type":26,"type_label":27},771,"4cecf968-aa84-47ec-9f32-c3b11432c18f","ollama-model-library-best-ai-models-local-use-4cecf968","Ollama Model Library — Best AI Models for Local Use","Curated guide to the best models available on Ollama for coding, chat, and reasoning. Compare Llama, Mistral, Gemma, Phi, and Qwen models for local AI development.","Skill Factory",331,{"id":37,"uuid":38,"slug":39,"title":40,"description":41,"author_name":22,"view_count":42,"vote_count":24,"lang_type":25,"type":26,"type_label":27},218,"5d37ffb8-d351-4fb1-8665-bef4db25b275","open-webui-self-hosted-ai-chat-interface-5d37ffb8","Open WebUI — Self-Hosted AI Chat Interface","User-friendly, self-hosted AI chat interface. Supports Ollama, OpenAI, Anthropic, and any OpenAI-compatible API. RAG, web search, voice, image gen, and plugins. 129K+ stars.",208,{"id":44,"uuid":45,"slug":46,"title":47,"description":48,"author_name":49,"view_count":50,"vote_count":24,"lang_type":25,"type":26,"type_label":27},282,"11107806-c69a-4b75-8360-d0504ff602d7","text-generation-webui-local-llm-chat-interface-11107806","Text Generation WebUI — Local LLM Chat Interface","Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.","AI Open Source",256,{"id":52,"uuid":53,"slug":54,"title":55,"description":56,"author_name":49,"view_count":57,"vote_count":24,"lang_type":25,"type":26,"type_label":27},274,"f493abd9-0870-49b3-a04b-719ee2a5df0f","gpt4all-run-llms-privately-your-desktop-f493abd9","GPT4All — Run LLMs Privately on Your Desktop","GPT4All runs large language models privately on everyday desktops and laptops without GPUs or API calls. 77.2K+ GitHub stars. Desktop app + Python SDK, LocalDocs for private data. MIT licensed.",225,{"id":59,"uuid":60,"slug":61,"title":62,"description":63,"author_name":49,"view_count":64,"vote_count":24,"lang_type":25,"type":26,"type_label":27},321,"210679a0-712f-4ec5-8d69-e0a016361c95","onyx-self-hosted-ai-chat-40-connectors-210679a0","Onyx — Self-Hosted AI Chat with 40+ Connectors","Onyx (formerly Danswer) is a self-hosted AI chat with RAG, custom agents, and 40+ knowledge connectors. 20.4K+ stars. Enterprise search. MIT.",251,{"id":66,"uuid":67,"slug":68,"title":69,"description":70,"author_name":22,"view_count":57,"vote_count":24,"lang_type":25,"type":26,"type_label":27},232,"735f5a27-07d6-47ac-8377-e29be76a9452","mlc-llm-universal-llm-deployment-engine-735f5a27","MLC-LLM — Universal LLM Deployment Engine","Deploy any LLM on any hardware — phones, browsers, GPUs, CPUs. Compiles models for native performance on iOS, Android, WebGPU, CUDA, Metal, and Vulkan. 22K+ stars.",{"id":72,"uuid":73,"slug":74,"title":75,"description":76,"author_name":49,"view_count":77,"vote_count":24,"lang_type":25,"type":26,"type_label":27},278,"7b703194-ec0f-4244-a98e-3ec206a883b8","jan-offline-ai-desktop-app-full-privacy-7b703194","Jan — Offline AI Desktop App with Full Privacy","Jan is an open-source ChatGPT alternative that runs LLMs locally with full privacy. 41.4K+ GitHub stars. Desktop app for Windows\u002FmacOS\u002FLinux, OpenAI-compatible API, MCP support. Apache 2.0.",214,{"id":79,"uuid":80,"slug":81,"title":82,"description":83,"author_name":49,"view_count":17,"vote_count":24,"lang_type":25,"type":26,"type_label":27},323,"4cbd3b7b-5251-4a16-a4ef-d7c1f9600d52","khoj-your-ai-second-brain-4cbd3b7b","Khoj — Your AI Second Brain","Khoj is a personal AI app for chat, search, and knowledge management. 33.8K+ stars. Multi-LLM, docs, Obsidian, WhatsApp, custom agents. AGPL-3.0.",{"id":85,"uuid":86,"slug":87,"title":88,"description":89,"author_name":22,"view_count":90,"vote_count":24,"lang_type":25,"type":26,"type_label":27},270,"24576b2c-a9d1-4f7a-9696-b1e5c50a17f3","faster-whisper-4x-faster-speech-text-24576b2c","Faster Whisper — 4x Faster Speech-to-Text","Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2, up to 4x faster with less memory. 21.8K+ GitHub stars. GPU\u002FCPU, 8-bit quantization, word timestamps, VAD. MIT licensed.",202,"tokrepo install pack\u002Fprivacy-first-local-ai",{"pageType":93,"pageKey":8,"locale":25,"title":94,"metaDescription":95,"h1":96,"tldr":97,"bodyMarkdown":98,"faq":99,"schema":115,"internalLinks":120,"citations":133,"wordCount":146,"generatedAt":147},"pack","Privacy-First Local AI — 10 Open-Source Tools for a 100% On-Device AI Stack","Ollama, LM Studio, Open WebUI, Jan, GPT4All, LibreChat, MLC-LLM, Text Generation WebUI, Khoj, Faster Whisper — the runtime, UI, RAG, and protocol layer of a privacy-first AI rig that never phones home. Built for law firms, hospitals, government, and any regulated industry.","Privacy-First Local AI — A Full Stack That Never Leaves the Building","Ten open-source picks organized by layer — runtime (Ollama, LM Studio, MLC-LLM), UI (Open WebUI, Jan, GPT4All, LibreChat, Text Generation WebUI), RAG (Khoj), and speech (Faster Whisper). Stand up a complete AI workflow where nothing — model weights, prompts, retrieved chunks, transcripts — ever crosses the network boundary you control.","## What's in this pack\n\nThis pack is for the buyer who reads \"we may use your data to improve our service\" in a vendor contract and walks away. Law firms working privileged matters. Hospitals handling PHI. Government agencies on air-gapped networks. Cross-border traders whose contract terms can't sit on an OpenAI server in Texas. The whole point is **no network call to anyone's API at inference time**.\n\nThe stack splits into three layers plus the protocol glue:\n\n- **Runtime layer** — the model engine that loads weights and serves tokens. Ollama, LM Studio, MLC-LLM live here.\n- **UI layer** — what a non-technical lawyer or doctor actually opens. Open WebUI, Jan, GPT4All, LibreChat, Text Generation WebUI are all variants of \"local ChatGPT.\"\n- **RAG layer** — connects the model to your own documents (case files, patient charts, customs declarations). Khoj indexes a folder and serves it back to any local model.\n- **Speech layer** — Faster Whisper transcribes meetings, depositions, and consultations locally so the audio never reaches a cloud STT vendor.\n\nThis is the **privacy-first sibling** of our `local-first-ai` pack (which optimizes for a developer's personal rig) and `local-llm-runners` (which compares engines side-by-side). This one is opinionated about the **regulated-industry deployment shape** — what to install when the buyer's compliance officer is the one signing off.\n\n## Install in this order\n\n1. **Ollama** — start here. Single binary, runs on macOS \u002F Linux \u002F Windows, pulls a model with `ollama pull llama3.1` and serves an OpenAI-compatible HTTP API on localhost. Everything downstream — UI, RAG, agents — can be pointed at Ollama as if it were OpenAI, which means you can swap any cloud setup to fully local by changing one base URL.\n2. **LM Studio** — the GUI alternative to Ollama for non-CLI users. Same job (load GGUF weights, serve a local OpenAI-compatible endpoint), but with a desktop app for model search, download, and chat. Use this on a lawyer's or doctor's workstation; use Ollama on the server.\n3. **MLC-LLM** — when Ollama is too slow on your hardware. Compiles models with TVM for native execution on Apple Silicon, NVIDIA, AMD, and even WebGPU. Worth the extra setup pain if your throughput target is real-time chat for 10+ concurrent users on a single workstation.\n4. **Open WebUI** — the ChatGPT clone. Docker-compose, points at Ollama's port, gives you multi-user accounts, chat history, model picker, document upload. This is what your end users actually open. The most-deployed local chat UI in regulated environments because of the per-user permission model.\n5. **Jan** — desktop app version of Open WebUI's idea. Cross-platform Electron client with built-in model downloader. Use this when each user gets their own laptop and you don't want to run a central server.\n6. **GPT4All** — same niche as Jan, different lineage (Nomic). Slightly smaller binary, opinionated default model picks, runs on CPU-only machines reasonably well. Worth installing on the ancient ThinkPads still on every loan officer's desk.\n7. **LibreChat** — the multi-provider chat UI for organizations that want one frontend across local + (carefully whitelisted) cloud. Useful in hybrid setups: privileged matters route to Ollama, general research routes to a contract-bound cloud provider. Per-user routing rules.\n8. **Text Generation WebUI** — the power-user fallback. Supports more model formats than anyone else (GGUF, GPTQ, AWQ, EXL2, transformers), exposes raw sampling parameters, and runs LoRA adapters. Install this when a researcher needs to fine-tune a model on internal data and still keep the workflow private.\n9. **Khoj** — the RAG layer. Watches a folder (case files, patient charts, customs PDFs), embeds them locally, serves them back via a chat UI or an API any other tool can call. Configure it to use Ollama for both embeddings and generation and your documents never leave the box.\n10. **Faster Whisper** — meeting \u002F deposition \u002F consultation transcription. 4x faster than vanilla Whisper at the same accuracy. Pipe its output into Khoj and you get a fully local \"AI that joined the meeting\" without sending audio to AssemblyAI, Deepgram, or OpenAI.\n\n## How they fit together\n\n```\n   ┌──────────────────────────────────────────────────────────┐\n   │  Your workstation \u002F on-prem server (no outbound network) │\n   └──────────────────────────────────────────────────────────┘\n\n   Laptop GPU \u002F server GPU \u002F Apple Silicon NPU\n             │\n             ▼\n   ┌──────────────────────────────┐\n   │  Runtime: Ollama \u002F LM Studio │  ◄── MLC-LLM if you need\n   │  (OpenAI-compatible API)     │      real-time multi-user\n   └──────────────────────────────┘\n             │\n   ┌─────────┴─────────────────────────┐\n   ▼                                   ▼\n  Chat UIs                          RAG layer\n   ├─ Open WebUI  (browser, multi-user)  Khoj  ◄── folder watch\n   ├─ Jan  (desktop client)               │       (case files,\n   ├─ GPT4All  (low-spec laptops)         │        patient docs,\n   ├─ LibreChat  (hybrid routing)         │        customs PDFs)\n   └─ Text Generation WebUI (power user)  ▼\n                                       Local vector DB\n                                       (SQLite \u002F Qdrant on disk)\n             │\n             ▼\n   Faster Whisper ─► transcripts ─► Khoj ─► chat answers your\n   (audio in, text out, all local)            recorded meetings\n```\n\nThe pattern is rigid on purpose: **one runtime** at the bottom (so every downstream tool points to the same OpenAI-compatible URL), **one or more UIs** chosen by user audience, **one RAG service** if documents matter, **one speech engine** if meetings matter. Don't run two competing runtimes — that's how you end up with 30 GB of duplicate model weights on the same machine.\n\n## Tradeoffs you'll hit\n\n- **Hardware floor is real** — a 7B model in 4-bit quantization wants ~5 GB RAM, runs at 15-30 tok\u002Fs on M1\u002FM2. A 70B model wants 40 GB+ and runs at 2-5 tok\u002Fs on M3 Max — usable for batch, painful for interactive chat. Don't promise the compliance officer GPT-4 quality from a 16 GB MacBook.\n- **Quality vs frontier models** — a strong open model (Llama 3.1 70B, Qwen 2.5 72B) is roughly GPT-4-class for short Q&A and noticeably weaker on long-form reasoning. For privileged drafting, expect to do 1-2 more revision passes than you would with a frontier cloud model. Most regulated buyers accept this; the alternative is sending privileged data to a third party.\n- **Maintenance cost** — you now own the deployment. Model updates, GPU driver upgrades, vector DB compaction, log rotation — all yours. Budget 0.2-0.5 FTE of IT time per 50 active users, more in year one.\n- **Multi-user concurrency** — Ollama serializes requests by default. For more than ~5 concurrent users on a single GPU, switch to a real inference server (vLLM, TGI) behind Open WebUI. The pack tools are correct for the single-team scale; the same UIs work in front of vLLM when you grow.\n- **Air-gap vs partial isolation** — \"fully local\" splits into two postures. **Air-gapped**: machine has no network at all, models pre-loaded via USB, updates via approval workflow. **Network-isolated**: machine has network for OS updates only, all AI traffic stays on localhost. The second is much easier; the first is what some government \u002F defense buyers actually require.\n\n## Common pitfalls\n\n- **M1 \u002F 16 GB MacBook trying to run a 70B model** — won't fit. It will load via mmap then thrash the SSD. Stick to 7-13B models on 16 GB, 30-34B on 32 GB, 70B on 64 GB+. Apple's unified memory makes this easier than discrete GPUs, but the math is still the math.\n- **Wrong-language model picks** — most open base models are English-heavy. For Chinese legal text, pick Qwen 2.5 (Chinese-strong by design). For Japanese medical notes, pick a fine-tune like ELYZA. Choosing Llama 3.1 for a non-English corpus and then complaining about quality is the most common failure mode.\n- **RAG chunking strategy** — naive 1000-token chunks destroy contract clauses (each clause is its own logical unit) and medical chart entries (each entry is timestamped and atomic). Configure Khoj's chunker to split on document-specific boundaries before you embed 50,000 case files and discover retrieval is useless.\n- **Forgetting to disable telemetry** — Open WebUI, Jan, Text Generation WebUI all ship with optional anonymous telemetry. Turn it off explicitly in config and verify with a network sniffer. The whole point of this stack is no outbound calls; a single forgotten checkbox undermines the compliance story.\n- **Storing PHI \u002F privileged data in cleartext on disk** — your model weights might be local, but if Khoj's vector DB sits unencrypted on the same disk, a stolen laptop leaks everything anyway. Enable full-disk encryption (FileVault, LUKS, BitLocker) on every machine in the stack. This is the audit finding that gets caught last, after every other privacy control passes.",[100,103,106,109,112],{"q":101,"a":102},"What can a Mac M1 with 16 GB RAM actually run from this stack?","Comfortably: Ollama or LM Studio loading Llama 3.1 8B or Qwen 2.5 7B at 4-bit quantization (~5 GB resident), Open WebUI or Jan as the chat UI, Khoj indexing a modest document set (\u003C10,000 chunks), and Faster Whisper for short transcriptions. You will get 15-25 tok\u002Fs, perfectly usable for interactive chat and Q&A over your own files. What you cannot run: 70B models (they need 40 GB+), real-time multi-user concurrency (Ollama serializes), or large RAG corpora that need a dedicated vector DB process. Upgrade target if it matters: a Mac M3 Max with 64 GB lets you run 70B comfortably and serves a small team.",{"q":104,"a":105},"We're on Windows desktops with no GPU. Is this stack still viable?","Yes, but the model size collapses. Ollama, LM Studio, and GPT4All all run CPU-only on Windows. Stick to 3-7B models in heavy quantization (Q4_K_M or smaller) and expect 3-8 tok\u002Fs — usable for one user doing focused Q&A, painful for chat-bot-style back-and-forth. For document RAG on CPU, Khoj works but embedding 10,000 documents will take overnight. The realistic shape: a single mid-range workstation with one consumer GPU (RTX 4070 or better) becomes the local AI server, and everyone else's Windows laptop just opens Open WebUI in a browser pointed at that server. One GPU server beats ten CPU-only laptops every time.",{"q":107,"a":108},"How does a hospital plug this into electronic medical records (EMR \u002F EHR)?","You do not feed the EMR database directly to a model — you extract a curated subset (patient notes, discharge summaries, lab reports) on a regular cadence into a vector index that Khoj or a similar RAG service watches. The EMR remains the source of truth; the AI stack only sees what your data engineer chose to expose, with PHI tokens replaced where required by hospital policy. Faster Whisper handles dictated patient notes locally before they ever touch a cloud STT vendor. The integration work is mostly building a one-way ETL from the EMR (Epic \u002F Cerner \u002F etc.) to a folder Khoj indexes, and getting hospital IT to bless the OS image. Expect 4-8 weeks for the first deployment, mostly compliance review, not engineering.",{"q":110,"a":111},"How long can this stack actually run offline \u002F air-gapped?","Indefinitely, with caveats. Once Ollama or LM Studio has pulled the model weights you want, no further network is required for inference. Open WebUI, Khoj, Jan all run entirely on localhost. The maintenance cycle without network access is: model updates need to be sneakernet-imported (download the GGUF on a connected machine, USB to the air-gapped one), OS and dependency updates follow the same workflow, and your audit log says exactly which version of each tool ran. We have buyers in this pack's target audience running 6-12 months between any updates. The honest constraint isn't \"can it run offline\" — it's whether your team will tolerate falling behind the open-source release cadence. Pick a baseline (Llama 3.1, Ollama 0.3.x, Open WebUI 0.3.x) and only upgrade on a planned cadence.",{"q":113,"a":114},"When we upgrade to a newer model, do we lose chat history and indexed documents?","No, if you set up the storage layout deliberately. Open WebUI stores chats in its own SQLite DB; that is independent of which model Ollama is currently serving. Khoj stores embeddings in a separate vector store; switching the LLM doesn't invalidate them, but switching the embedding model does (because new vectors live in a different geometry). The practical rule: pin one embedding model (e.g. nomic-embed-text or BGE-M3) and never change it without re-indexing your whole corpus. The chat LLM you can upgrade freely — old chats remain readable, new chats use the new model. Document this in your runbook before the first user asks 'why did my notes disappear?'",{"@context":116,"@type":117,"name":13,"description":118,"numberOfItems":119,"inLanguage":25},"https:\u002F\u002Fschema.org","ItemList","Ten open-source tools to assemble a fully local AI workflow for regulated industries — runtime layer (Ollama, LM Studio, MLC-LLM), UI layer (Open WebUI, Jan, GPT4All, LibreChat, Text Generation WebUI), RAG layer (Khoj), and speech transcription (Faster Whisper). Nothing leaves your network boundary.",10,[121,125,129],{"url":122,"anchor":123,"reason":124},"\u002Fen\u002Flocal-first-ai","Local-First AI — developer's personal private rig","Sister pack from a developer-rig angle: chat \u002F code \u002F image gen \u002F transcription on one laptop",{"url":126,"anchor":127,"reason":128},"\u002Fen\u002Flocal-llm-runners","Compare local LLM runners side-by-side","Deep dive on the runtime layer alone — Ollama vs LM Studio vs MLC-LLM vs Jan vs more",{"url":130,"anchor":131,"reason":132},"\u002Fen\u002Fpersonal-knowledge-base-rag","Personal RAG over notes and PDFs","Complementary RAG-focused pack for personal knowledge bases rather than regulated documents",[134,138,142],{"claim":135,"source_name":136,"source_url":137},"Ollama exposes an OpenAI-compatible HTTP API on localhost","Ollama OpenAI compatibility docs","https:\u002F\u002Fgithub.com\u002Follama\u002Follama\u002Fblob\u002Fmain\u002Fdocs\u002Fopenai.md",{"claim":139,"source_name":140,"source_url":141},"Open WebUI is a self-hosted multi-user chat interface for local LLMs","Open WebUI project","https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui",{"claim":143,"source_name":144,"source_url":145},"Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2 for faster local inference","Faster Whisper repository","https:\u002F\u002Fgithub.com\u002FSYSTRAN\u002Ffaster-whisper",920,"2026-05-23T00:00:00Z"]