Privacy-First Local AI — Pile IA 100% Locale
Dix picks open-source pour les secteurs où les données ne peuvent pas sortir du bâtiment — cabinets d'avocats, hôpitaux, administrations, commerce international. Runtime + UI + RAG + couche protocole, tout on-prem. Pas d'API vendor, pas de télémétrie.
What's in this pack
This pack is for the buyer who reads "we may use your data to improve our service" in a vendor contract and walks away. Law firms working privileged matters. Hospitals handling PHI. Government agencies on air-gapped networks. Cross-border traders whose contract terms can't sit on an OpenAI server in Texas. The whole point is no network call to anyone's API at inference time.
The stack splits into three layers plus the protocol glue:
- Runtime layer — the model engine that loads weights and serves tokens. Ollama, LM Studio, MLC-LLM live here.
- UI layer — what a non-technical lawyer or doctor actually opens. Open WebUI, Jan, GPT4All, LibreChat, Text Generation WebUI are all variants of "local ChatGPT."
- RAG layer — connects the model to your own documents (case files, patient charts, customs declarations). Khoj indexes a folder and serves it back to any local model.
- Speech layer — Faster Whisper transcribes meetings, depositions, and consultations locally so the audio never reaches a cloud STT vendor.
This is the privacy-first sibling of our local-first-ai pack (which optimizes for a developer's personal rig) and local-llm-runners (which compares engines side-by-side). This one is opinionated about the regulated-industry deployment shape — what to install when the buyer's compliance officer is the one signing off.
Install in this order
- Ollama — start here. Single binary, runs on macOS / Linux / Windows, pulls a model with
ollama pull llama3.1and serves an OpenAI-compatible HTTP API on localhost. Everything downstream — UI, RAG, agents — can be pointed at Ollama as if it were OpenAI, which means you can swap any cloud setup to fully local by changing one base URL. - LM Studio — the GUI alternative to Ollama for non-CLI users. Same job (load GGUF weights, serve a local OpenAI-compatible endpoint), but with a desktop app for model search, download, and chat. Use this on a lawyer's or doctor's workstation; use Ollama on the server.
- MLC-LLM — when Ollama is too slow on your hardware. Compiles models with TVM for native execution on Apple Silicon, NVIDIA, AMD, and even WebGPU. Worth the extra setup pain if your throughput target is real-time chat for 10+ concurrent users on a single workstation.
- Open WebUI — the ChatGPT clone. Docker-compose, points at Ollama's port, gives you multi-user accounts, chat history, model picker, document upload. This is what your end users actually open. The most-deployed local chat UI in regulated environments because of the per-user permission model.
- Jan — desktop app version of Open WebUI's idea. Cross-platform Electron client with built-in model downloader. Use this when each user gets their own laptop and you don't want to run a central server.
- GPT4All — same niche as Jan, different lineage (Nomic). Slightly smaller binary, opinionated default model picks, runs on CPU-only machines reasonably well. Worth installing on the ancient ThinkPads still on every loan officer's desk.
- LibreChat — the multi-provider chat UI for organizations that want one frontend across local + (carefully whitelisted) cloud. Useful in hybrid setups: privileged matters route to Ollama, general research routes to a contract-bound cloud provider. Per-user routing rules.
- Text Generation WebUI — the power-user fallback. Supports more model formats than anyone else (GGUF, GPTQ, AWQ, EXL2, transformers), exposes raw sampling parameters, and runs LoRA adapters. Install this when a researcher needs to fine-tune a model on internal data and still keep the workflow private.
- Khoj — the RAG layer. Watches a folder (case files, patient charts, customs PDFs), embeds them locally, serves them back via a chat UI or an API any other tool can call. Configure it to use Ollama for both embeddings and generation and your documents never leave the box.
- Faster Whisper — meeting / deposition / consultation transcription. 4x faster than vanilla Whisper at the same accuracy. Pipe its output into Khoj and you get a fully local "AI that joined the meeting" without sending audio to AssemblyAI, Deepgram, or OpenAI.
How they fit together
┌──────────────────────────────────────────────────────────┐
│ Your workstation / on-prem server (no outbound network) │
└──────────────────────────────────────────────────────────┘
Laptop GPU / server GPU / Apple Silicon NPU
│
▼
┌──────────────────────────────┐
│ Runtime: Ollama / LM Studio │ ◄── MLC-LLM if you need
│ (OpenAI-compatible API) │ real-time multi-user
└──────────────────────────────┘
│
┌─────────┴─────────────────────────┐
▼ ▼
Chat UIs RAG layer
├─ Open WebUI (browser, multi-user) Khoj ◄── folder watch
├─ Jan (desktop client) │ (case files,
├─ GPT4All (low-spec laptops) │ patient docs,
├─ LibreChat (hybrid routing) │ customs PDFs)
└─ Text Generation WebUI (power user) ▼
Local vector DB
(SQLite / Qdrant on disk)
│
▼
Faster Whisper ─► transcripts ─► Khoj ─► chat answers your
(audio in, text out, all local) recorded meetings
The pattern is rigid on purpose: one runtime at the bottom (so every downstream tool points to the same OpenAI-compatible URL), one or more UIs chosen by user audience, one RAG service if documents matter, one speech engine if meetings matter. Don't run two competing runtimes — that's how you end up with 30 GB of duplicate model weights on the same machine.
Tradeoffs you'll hit
- Hardware floor is real — a 7B model in 4-bit quantization wants ~5 GB RAM, runs at 15-30 tok/s on M1/M2. A 70B model wants 40 GB+ and runs at 2-5 tok/s on M3 Max — usable for batch, painful for interactive chat. Don't promise the compliance officer GPT-4 quality from a 16 GB MacBook.
- Quality vs frontier models — a strong open model (Llama 3.1 70B, Qwen 2.5 72B) is roughly GPT-4-class for short Q&A and noticeably weaker on long-form reasoning. For privileged drafting, expect to do 1-2 more revision passes than you would with a frontier cloud model. Most regulated buyers accept this; the alternative is sending privileged data to a third party.
- Maintenance cost — you now own the deployment. Model updates, GPU driver upgrades, vector DB compaction, log rotation — all yours. Budget 0.2-0.5 FTE of IT time per 50 active users, more in year one.
- Multi-user concurrency — Ollama serializes requests by default. For more than ~5 concurrent users on a single GPU, switch to a real inference server (vLLM, TGI) behind Open WebUI. The pack tools are correct for the single-team scale; the same UIs work in front of vLLM when you grow.
- Air-gap vs partial isolation — "fully local" splits into two postures. Air-gapped: machine has no network at all, models pre-loaded via USB, updates via approval workflow. Network-isolated: machine has network for OS updates only, all AI traffic stays on localhost. The second is much easier; the first is what some government / defense buyers actually require.
Common pitfalls
- M1 / 16 GB MacBook trying to run a 70B model — won't fit. It will load via mmap then thrash the SSD. Stick to 7-13B models on 16 GB, 30-34B on 32 GB, 70B on 64 GB+. Apple's unified memory makes this easier than discrete GPUs, but the math is still the math.
- Wrong-language model picks — most open base models are English-heavy. For Chinese legal text, pick Qwen 2.5 (Chinese-strong by design). For Japanese medical notes, pick a fine-tune like ELYZA. Choosing Llama 3.1 for a non-English corpus and then complaining about quality is the most common failure mode.
- RAG chunking strategy — naive 1000-token chunks destroy contract clauses (each clause is its own logical unit) and medical chart entries (each entry is timestamped and atomic). Configure Khoj's chunker to split on document-specific boundaries before you embed 50,000 case files and discover retrieval is useless.
- Forgetting to disable telemetry — Open WebUI, Jan, Text Generation WebUI all ship with optional anonymous telemetry. Turn it off explicitly in config and verify with a network sniffer. The whole point of this stack is no outbound calls; a single forgotten checkbox undermines the compliance story.
- Storing PHI / privileged data in cleartext on disk — your model weights might be local, but if Khoj's vector DB sits unencrypted on the same disk, a stolen laptop leaks everything anyway. Enable full-disk encryption (FileVault, LUKS, BitLocker) on every machine in the stack. This is the audit finding that gets caught last, after every other privacy control passes.
10 ressources prêtes à installer
Questions fréquentes
What can a Mac M1 with 16 GB RAM actually run from this stack?
Comfortably: Ollama or LM Studio loading Llama 3.1 8B or Qwen 2.5 7B at 4-bit quantization (~5 GB resident), Open WebUI or Jan as the chat UI, Khoj indexing a modest document set (<10,000 chunks), and Faster Whisper for short transcriptions. You will get 15-25 tok/s, perfectly usable for interactive chat and Q&A over your own files. What you cannot run: 70B models (they need 40 GB+), real-time multi-user concurrency (Ollama serializes), or large RAG corpora that need a dedicated vector DB process. Upgrade target if it matters: a Mac M3 Max with 64 GB lets you run 70B comfortably and serves a small team.
We're on Windows desktops with no GPU. Is this stack still viable?
Yes, but the model size collapses. Ollama, LM Studio, and GPT4All all run CPU-only on Windows. Stick to 3-7B models in heavy quantization (Q4_K_M or smaller) and expect 3-8 tok/s — usable for one user doing focused Q&A, painful for chat-bot-style back-and-forth. For document RAG on CPU, Khoj works but embedding 10,000 documents will take overnight. The realistic shape: a single mid-range workstation with one consumer GPU (RTX 4070 or better) becomes the local AI server, and everyone else's Windows laptop just opens Open WebUI in a browser pointed at that server. One GPU server beats ten CPU-only laptops every time.
How does a hospital plug this into electronic medical records (EMR / EHR)?
You do not feed the EMR database directly to a model — you extract a curated subset (patient notes, discharge summaries, lab reports) on a regular cadence into a vector index that Khoj or a similar RAG service watches. The EMR remains the source of truth; the AI stack only sees what your data engineer chose to expose, with PHI tokens replaced where required by hospital policy. Faster Whisper handles dictated patient notes locally before they ever touch a cloud STT vendor. The integration work is mostly building a one-way ETL from the EMR (Epic / Cerner / etc.) to a folder Khoj indexes, and getting hospital IT to bless the OS image. Expect 4-8 weeks for the first deployment, mostly compliance review, not engineering.
How long can this stack actually run offline / air-gapped?
Indefinitely, with caveats. Once Ollama or LM Studio has pulled the model weights you want, no further network is required for inference. Open WebUI, Khoj, Jan all run entirely on localhost. The maintenance cycle without network access is: model updates need to be sneakernet-imported (download the GGUF on a connected machine, USB to the air-gapped one), OS and dependency updates follow the same workflow, and your audit log says exactly which version of each tool ran. We have buyers in this pack's target audience running 6-12 months between any updates. The honest constraint isn't "can it run offline" — it's whether your team will tolerate falling behind the open-source release cadence. Pick a baseline (Llama 3.1, Ollama 0.3.x, Open WebUI 0.3.x) and only upgrade on a planned cadence.
When we upgrade to a newer model, do we lose chat history and indexed documents?
No, if you set up the storage layout deliberately. Open WebUI stores chats in its own SQLite DB; that is independent of which model Ollama is currently serving. Khoj stores embeddings in a separate vector store; switching the LLM doesn't invalidate them, but switching the embedding model does (because new vectors live in a different geometry). The practical rule: pin one embedding model (e.g. nomic-embed-text or BGE-M3) and never change it without re-indexing your whole corpus. The chat LLM you can upgrade freely — old chats remain readable, new chats use the new model. Document this in your runbook before the first user asks 'why did my notes disappear?'
12 packs · 80+ ressources sélectionnées
Découvrez tous les packs curatés sur la page d'accueil
Retour à tous les packs