TOKREPO · Arsenal IA

Stable

Privacy-First Local AI — Pile IA 100% Locale

Dix picks open-source pour les secteurs où les données ne peuvent pas sortir du bâtiment — cabinets d'avocats, hôpitaux, administrations, commerce international. Runtime + UI + RAG + couche protocole, tout on-prem. Pas d'API vendor, pas de télémétrie.

10 ressources

À propos de ce pack

What's in this pack

This pack is for the buyer who reads "we may use your data to improve our service" in a vendor contract and walks away. Law firms working privileged matters. Hospitals handling PHI. Government agencies on air-gapped networks. Cross-border traders whose contract terms can't sit on an OpenAI server in Texas. The whole point is no network call to anyone's API at inference time.

The stack splits into three layers plus the protocol glue:

Runtime layer — the model engine that loads weights and serves tokens. Ollama, LM Studio, MLC-LLM live here.
UI layer — what a non-technical lawyer or doctor actually opens. Open WebUI, Jan, GPT4All, LibreChat, Text Generation WebUI are all variants of "local ChatGPT."
RAG layer — connects the model to your own documents (case files, patient charts, customs declarations). Khoj indexes a folder and serves it back to any local model.
Speech layer — Faster Whisper transcribes meetings, depositions, and consultations locally so the audio never reaches a cloud STT vendor.

This is the privacy-first sibling of our local-first-ai pack (which optimizes for a developer's personal rig) and local-llm-runners (which compares engines side-by-side). This one is opinionated about the regulated-industry deployment shape — what to install when the buyer's compliance officer is the one signing off.

Install in this order

Ollama — start here. Single binary, runs on macOS / Linux / Windows, pulls a model with ollama pull llama3.1 and serves an OpenAI-compatible HTTP API on localhost. Everything downstream — UI, RAG, agents — can be pointed at Ollama as if it were OpenAI, which means you can swap any cloud setup to fully local by changing one base URL.
LM Studio — the GUI alternative to Ollama for non-CLI users. Same job (load GGUF weights, serve a local OpenAI-compatible endpoint), but with a desktop app for model search, download, and chat. Use this on a lawyer's or doctor's workstation; use Ollama on the server.
MLC-LLM — when Ollama is too slow on your hardware. Compiles models with TVM for native execution on Apple Silicon, NVIDIA, AMD, and even WebGPU. Worth the extra setup pain if your throughput target is real-time chat for 10+ concurrent users on a single workstation.
Open WebUI — the ChatGPT clone. Docker-compose, points at Ollama's port, gives you multi-user accounts, chat history, model picker, document upload. This is what your end users actually open. The most-deployed local chat UI in regulated environments because of the per-user permission model.
Jan — desktop app version of Open WebUI's idea. Cross-platform Electron client with built-in model downloader. Use this when each user gets their own laptop and you don't want to run a central server.
GPT4All — same niche as Jan, different lineage (Nomic). Slightly smaller binary, opinionated default model picks, runs on CPU-only machines reasonably well. Worth installing on the ancient ThinkPads still on every loan officer's desk.
LibreChat — the multi-provider chat UI for organizations that want one frontend across local + (carefully whitelisted) cloud. Useful in hybrid setups: privileged matters route to Ollama, general research routes to a contract-bound cloud provider. Per-user routing rules.
Text Generation WebUI — the power-user fallback. Supports more model formats than anyone else (GGUF, GPTQ, AWQ, EXL2, transformers), exposes raw sampling parameters, and runs LoRA adapters. Install this when a researcher needs to fine-tune a model on internal data and still keep the workflow private.
Khoj — the RAG layer. Watches a folder (case files, patient charts, customs PDFs), embeds them locally, serves them back via a chat UI or an API any other tool can call. Configure it to use Ollama for both embeddings and generation and your documents never leave the box.
Faster Whisper — meeting / deposition / consultation transcription. 4x faster than vanilla Whisper at the same accuracy. Pipe its output into Khoj and you get a fully local "AI that joined the meeting" without sending audio to AssemblyAI, Deepgram, or OpenAI.

How they fit together

   ┌──────────────────────────────────────────────────────────┐
   │  Your workstation / on-prem server (no outbound network) │
   └──────────────────────────────────────────────────────────┘

   Laptop GPU / server GPU / Apple Silicon NPU
             │
             ▼
   ┌──────────────────────────────┐
   │  Runtime: Ollama / LM Studio │  ◄── MLC-LLM if you need
   │  (OpenAI-compatible API)     │      real-time multi-user
   └──────────────────────────────┘
             │
   ┌─────────┴─────────────────────────┐
   ▼                                   ▼
  Chat UIs                          RAG layer
   ├─ Open WebUI  (browser, multi-user)  Khoj  ◄── folder watch
   ├─ Jan  (desktop client)               │       (case files,
   ├─ GPT4All  (low-spec laptops)         │        patient docs,
   ├─ LibreChat  (hybrid routing)         │        customs PDFs)
   └─ Text Generation WebUI (power user)  ▼
                                       Local vector DB
                                       (SQLite / Qdrant on disk)
             │
             ▼
   Faster Whisper ─► transcripts ─► Khoj ─► chat answers your
   (audio in, text out, all local)            recorded meetings

The pattern is rigid on purpose: one runtime at the bottom (so every downstream tool points to the same OpenAI-compatible URL), one or more UIs chosen by user audience, one RAG service if documents matter, one speech engine if meetings matter. Don't run two competing runtimes — that's how you end up with 30 GB of duplicate model weights on the same machine.

Tradeoffs you'll hit

Hardware floor is real — a 7B model in 4-bit quantization wants ~5 GB RAM, runs at 15-30 tok/s on M1/M2. A 70B model wants 40 GB+ and runs at 2-5 tok/s on M3 Max — usable for batch, painful for interactive chat. Don't promise the compliance officer GPT-4 quality from a 16 GB MacBook.
Quality vs frontier models — a strong open model (Llama 3.1 70B, Qwen 2.5 72B) is roughly GPT-4-class for short Q&A and noticeably weaker on long-form reasoning. For privileged drafting, expect to do 1-2 more revision passes than you would with a frontier cloud model. Most regulated buyers accept this; the alternative is sending privileged data to a third party.
Maintenance cost — you now own the deployment. Model updates, GPU driver upgrades, vector DB compaction, log rotation — all yours. Budget 0.2-0.5 FTE of IT time per 50 active users, more in year one.
Multi-user concurrency — Ollama serializes requests by default. For more than ~5 concurrent users on a single GPU, switch to a real inference server (vLLM, TGI) behind Open WebUI. The pack tools are correct for the single-team scale; the same UIs work in front of vLLM when you grow.
Air-gap vs partial isolation — "fully local" splits into two postures. Air-gapped: machine has no network at all, models pre-loaded via USB, updates via approval workflow. Network-isolated: machine has network for OS updates only, all AI traffic stays on localhost. The second is much easier; the first is what some government / defense buyers actually require.

Common pitfalls

M1 / 16 GB MacBook trying to run a 70B model — won't fit. It will load via mmap then thrash the SSD. Stick to 7-13B models on 16 GB, 30-34B on 32 GB, 70B on 64 GB+. Apple's unified memory makes this easier than discrete GPUs, but the math is still the math.
Wrong-language model picks — most open base models are English-heavy. For Chinese legal text, pick Qwen 2.5 (Chinese-strong by design). For Japanese medical notes, pick a fine-tune like ELYZA. Choosing Llama 3.1 for a non-English corpus and then complaining about quality is the most common failure mode.
RAG chunking strategy — naive 1000-token chunks destroy contract clauses (each clause is its own logical unit) and medical chart entries (each entry is timestamped and atomic). Configure Khoj's chunker to split on document-specific boundaries before you embed 50,000 case files and discover retrieval is useless.
Forgetting to disable telemetry — Open WebUI, Jan, Text Generation WebUI all ship with optional anonymous telemetry. Turn it off explicitly in config and verify with a network sniffer. The whole point of this stack is no outbound calls; a single forgotten checkbox undermines the compliance story.
Storing PHI / privileged data in cleartext on disk — your model weights might be local, but if Khoj's vector DB sits unencrypted on the same disk, a stolen laptop leaks everything anyway. Enable full-disk encryption (FileVault, LUKS, BitLocker) on every machine in the stack. This is the audit finding that gets caught last, after every other privacy control passes.

INSTALLER · UNE COMMANDE

$ tokrepo install pack/privacy-first-local-ai

passez-la à votre agent — ou collez-la dans votre terminal

Ce qu'il contient

10 ressources prêtes à installer

Skill#01

Ollama — Run LLMs Locally

Run large language models locally on your machine. Supports Llama 3, Mistral, Gemma, Phi, and dozens more. One-command install, OpenAI-compatible API.

by Script Depot·377 views

$ tokrepo install ollama-run-llms-locally-0eefb7ad

Skill#02

Ollama Model Library — Best AI Models for Local Use

Curated guide to the best models available on Ollama for coding, chat, and reasoning. Compare Llama, Mistral, Gemma, Phi, and Qwen models for local AI development.

by Skill Factory·585 views

$ tokrepo install ollama-model-library-best-ai-models-local-use-4cecf968

Skill#03

Open WebUI — Self-Hosted AI Chat Interface

User-friendly, self-hosted AI chat interface. Supports Ollama, OpenAI, Anthropic, and any OpenAI-compatible API. RAG, web search, voice, image gen, and plugins. 129K+ stars.

by Script Depot·414 views

$ tokrepo install open-webui-self-hosted-ai-chat-interface-5d37ffb8

Skill#04

Text Generation WebUI — Local LLM Chat Interface

Text Generation WebUI is a Gradio interface for running LLMs locally. 46.4K+ GitHub stars. Multiple backends, vision, training, image gen, OpenAI-compatible API. 100% offline.

by AI Open Source·471 views

$ tokrepo install text-generation-webui-local-llm-chat-interface-11107806

Skill#05

GPT4All — Run LLMs Privately on Your Desktop

GPT4All runs large language models privately on everyday desktops and laptops without GPUs or API calls. 77.2K+ GitHub stars. Desktop app + Python SDK, LocalDocs for private data. MIT licensed.

by AI Open Source·353 views

$ tokrepo install gpt4all-run-llms-privately-your-desktop-f493abd9

Skill#06

Onyx — Self-Hosted AI Chat with 40+ Connectors

Onyx (formerly Danswer) is a self-hosted AI chat with RAG, custom agents, and 40+ knowledge connectors. 20.4K+ stars. Enterprise search. MIT.

by AI Open Source·458 views

$ tokrepo install onyx-self-hosted-ai-chat-40-connectors-210679a0

Skill#07

MLC-LLM — Universal LLM Deployment Engine

Deploy any LLM on any hardware — phones, browsers, GPUs, CPUs. Compiles models for native performance on iOS, Android, WebGPU, CUDA, Metal, and Vulkan. 22K+ stars.

by Script Depot·423 views

$ tokrepo install mlc-llm-universal-llm-deployment-engine-735f5a27

Skill#08

Jan — Offline AI Desktop App with Full Privacy

Jan is an open-source ChatGPT alternative that runs LLMs locally with full privacy. 41.4K+ GitHub stars. Desktop app for Windows/macOS/Linux, OpenAI-compatible API, MCP support. Apache 2.0.

by AI Open Source·403 views

$ tokrepo install jan-offline-ai-desktop-app-full-privacy-7b703194

Skill#09

Khoj — Your AI Second Brain

Khoj is a personal AI app for chat, search, and knowledge management. 33.8K+ stars. Multi-LLM, docs, Obsidian, WhatsApp, custom agents. AGPL-3.0.

by AI Open Source·258 views

$ tokrepo install khoj-your-ai-second-brain-4cbd3b7b

Skill#10

Faster Whisper — 4x Faster Speech-to-Text

Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2, up to 4x faster with less memory. 21.8K+ GitHub stars. GPU/CPU, 8-bit quantization, word timestamps, VAD. MIT licensed.

by Script Depot·372 views

$ tokrepo install faster-whisper-4x-faster-speech-text-24576b2c

Questions fréquentes

What can a Mac M1 with 16 GB RAM actually run from this stack?

Comfortably: Ollama or LM Studio loading Llama 3.1 8B or Qwen 2.5 7B at 4-bit quantization (~5 GB resident), Open WebUI or Jan as the chat UI, Khoj indexing a modest document set (<10,000 chunks), and Faster Whisper for short transcriptions. You will get 15-25 tok/s, perfectly usable for interactive chat and Q&A over your own files. What you cannot run: 70B models (they need 40 GB+), real-time multi-user concurrency (Ollama serializes), or large RAG corpora that need a dedicated vector DB process. Upgrade target if it matters: a Mac M3 Max with 64 GB lets you run 70B comfortably and serves a small team.

We're on Windows desktops with no GPU. Is this stack still viable?

Yes, but the model size collapses. Ollama, LM Studio, and GPT4All all run CPU-only on Windows. Stick to 3-7B models in heavy quantization (Q4_K_M or smaller) and expect 3-8 tok/s — usable for one user doing focused Q&A, painful for chat-bot-style back-and-forth. For document RAG on CPU, Khoj works but embedding 10,000 documents will take overnight. The realistic shape: a single mid-range workstation with one consumer GPU (RTX 4070 or better) becomes the local AI server, and everyone else's Windows laptop just opens Open WebUI in a browser pointed at that server. One GPU server beats ten CPU-only laptops every time.

How does a hospital plug this into electronic medical records (EMR / EHR)?

You do not feed the EMR database directly to a model — you extract a curated subset (patient notes, discharge summaries, lab reports) on a regular cadence into a vector index that Khoj or a similar RAG service watches. The EMR remains the source of truth; the AI stack only sees what your data engineer chose to expose, with PHI tokens replaced where required by hospital policy. Faster Whisper handles dictated patient notes locally before they ever touch a cloud STT vendor. The integration work is mostly building a one-way ETL from the EMR (Epic / Cerner / etc.) to a folder Khoj indexes, and getting hospital IT to bless the OS image. Expect 4-8 weeks for the first deployment, mostly compliance review, not engineering.

How long can this stack actually run offline / air-gapped?

Indefinitely, with caveats. Once Ollama or LM Studio has pulled the model weights you want, no further network is required for inference. Open WebUI, Khoj, Jan all run entirely on localhost. The maintenance cycle without network access is: model updates need to be sneakernet-imported (download the GGUF on a connected machine, USB to the air-gapped one), OS and dependency updates follow the same workflow, and your audit log says exactly which version of each tool ran. We have buyers in this pack's target audience running 6-12 months between any updates. The honest constraint isn't "can it run offline" — it's whether your team will tolerate falling behind the open-source release cadence. Pick a baseline (Llama 3.1, Ollama 0.3.x, Open WebUI 0.3.x) and only upgrade on a planned cadence.

When we upgrade to a newer model, do we lose chat history and indexed documents?

No, if you set up the storage layout deliberately. Open WebUI stores chats in its own SQLite DB; that is independent of which model Ollama is currently serving. Khoj stores embeddings in a separate vector store; switching the LLM doesn't invalidate them, but switching the embedding model does (because new vectors live in a different geometry). The practical rule: pin one embedding model (e.g. nomic-embed-text or BGE-M3) and never change it without re-indexing your whole corpus. The chat LLM you can upgrade freely — old chats remain readable, new chats use the new model. Document this in your runbook before the first user asks 'why did my notes disappear?'

PLUS DANS L'ARSENAL

12 packs · 80+ ressources sélectionnées

Découvrez tous les packs curatés sur la page d'accueil

Retour à tous les packs