IA Local-First — Tus Datos Nunca Salen del Portátil
Nueve picks open-source para un flujo de IA completo — chat, RAG sobre tus documentos, código, transcripción, generación de imágenes — todo corriendo en tu máquina. Sin claves OpenAI, sin facturas de tokens.
What's in this pack
This is the rig you build when you've decided your journal, your client recordings, and your half-written code are not going into someone else's training set. Every tool here is open-source, actively maintained, and runs with no outbound network call required once the models are downloaded.
The motivation is rarely just privacy in the abstract. It's three concrete things stacked: (1) the monthly token bill that scales with how curious you are, (2) terms of service that change, and (3) the dawning realization that you've been pasting your entire inbox into a chat window owned by a company that openly indexes it. A local stack fixes all three permanently.
This pack is not the same as our self-hosted-ai pack — that one is for shipping a SaaS on your own metal (Tabby, Onyx, LibreChat, n8n). This one is for individuals who want a private AI on a personal machine, including non-developer tools like meeting transcription and a notes app.
Install in this order
- Ollama — model runner. Start here. Single command (
curl -fsSL ollama.com/install.sh | sh), pulls models withollama pull llama3.1, exposes an OpenAI-compatible API onlocalhost:11434. Everything downstream points at this. - GPT4All — alternative model runner with a GUI. If you don't live in a terminal, install this instead of (or alongside) Ollama. Same job, friendlier surface for non-devs.
- Open WebUI — the local ChatGPT replacement. Talks to Ollama out of the box, supports multi-turn chat, RAG over uploaded files, web search plugins. This is where 80% of "I just want to ask the AI something" happens.
- Continue — local coding assistant for VS Code and JetBrains. Configure it to call your local Ollama model instead of Copilot's servers. Inline edits, chat, refactor — all on-device. Slower than Copilot, but your private repo never leaves the machine.
- Khoj — AI second brain. Indexes your Markdown notes, PDFs, org-mode, even Notion exports, then lets you chat with them via local LLM. This is the RAG layer for your life, not your codebase.
- Faster Whisper — speech-to-text. 4x faster than vanilla Whisper, runs on CPU or GPU, OpenAI Whisper accuracy. Drop audio in, get a transcript out. Foundation for the next tool.
- Meetily — privacy-first meeting assistant. Records, transcribes via Whisper locally, summarizes via your local LLM. Zoom/Meet recordings never touch a cloud.
- ComfyUI — local image generation via Stable Diffusion. Node-based, fast on Apple Silicon and CUDA, runs SDXL / Flux / SD3 models pulled from Hugging Face. No prompt logging, no content policy, no usage cap.
- Joplin — privacy-focused note app with optional end-to-end encryption. Where you keep the source material your local AI reads. Markdown, plugins, syncs between devices via your own storage.
How they fit together
┌─────────────────────────────────────┐
│ Your laptop (no outbound calls) │
└─────────────────────────────────────┘
│
┌────────────────────┴────────────────────┐
│ │
Ollama / GPT4All ◄──── OpenAI-compatible API ────┐
(model runner) │
│ │
├─► Open WebUI ─── chat in browser │
│ │
├─► Continue ─── code in VS Code │
│ │
├─► Khoj ─── chat with your notes ◄── Joplin
│ │
└─► Meetily ─── meeting summary ◄── Faster Whisper
│
ComfyUI ── standalone (its own model runtime) ─────┘
The trick is that all six client tools (Open WebUI, Continue, Khoj, Meetily, plus anything else you wire up) point at the single Ollama endpoint. You download a model once. Every app reuses it. Disk and RAM are the budgets to watch, not API quota.
Tradeoffs you'll hit
- Cloud quality vs local quality — Be honest: GPT-5 / Claude 4.5 still beat any 8B-quant local model at frontier reasoning, long-context, and code generation on unfamiliar codebases. Local wins on privacy, latency for short prompts, cost at volume, and offline use. The right mental model is "local for 80% of daily work, cloud for the hard 20%" — not "local replaces cloud".
- Apple Silicon vs NVIDIA — Apple Silicon M2/M3/M4 with 32 GB+ RAM runs 13B models comfortably via Metal/MPS. NVIDIA with 16 GB+ VRAM is faster on bigger models but louder, hotter, more expensive. Most of this pack runs well on a $2K Mac; ComfyUI and 70B models start asking for a real GPU.
- Quantized vs full precision — Most Ollama models default to Q4_K_M (4-bit quantization). You lose maybe 2-3% accuracy for 4x less RAM. Always start quantized. Only go full precision if you can measure a quality gap that matters to you.
Common pitfalls
- RAM blow-ups — running Open WebUI + Continue + Khoj simultaneously, each holding a model in memory, will OOM a 16 GB machine. Configure Ollama with
OLLAMA_MAX_LOADED_MODELS=1and let it page models in and out. - Model files are huge — Llama 3.1 70B is 40 GB on disk. Plan storage before you
ollama pulleverything that looks interesting. Keep a kill list. - MPS vs CUDA confusion — most install guides assume NVIDIA. On Apple Silicon, check for the
-metalormpsvariant of each tool. ComfyUI in particular needs the right Python wheel. - "Actually I do need cloud for X" — be at peace with it. Routing your frontier-difficulty queries to Claude/GPT through a privacy-aware client (LibreChat with logging off, or just the API with
Bearerand no organization ID) is a sane hybrid. - Voice assistant ambition — Meetily + Faster Whisper handle batch transcription beautifully. Real-time conversational voice (sub-500ms latency, interruption) is still genuinely hard locally. Don't promise that to yourself in week one.
9 recursos listos para instalar
Preguntas frecuentes
Is local AI really private if I'm pulling models from Hugging Face / Ollama?
Yes — the model download is a one-time fetch of weights. Once the file is on disk, the model runs entirely offline. No prompt, no document, no transcript is ever sent to Hugging Face or Ollama servers. Verify with Little Snitch or lsof -i if you want proof. The trust boundary is the open-source model itself, not the distribution channel.
What hardware do I actually need for this stack?
Comfortable: Apple Silicon Mac with 32 GB unified RAM, or a Windows/Linux box with an NVIDIA GPU with 16 GB+ VRAM. Minimum viable: 16 GB RAM Mac runs 7-8B models and Faster Whisper fine but you'll juggle one model at a time. ComfyUI (image gen) is the most demanding piece; everything else is reasonable on a 4-year-old laptop.
How does this differ from the existing self-hosted-ai pack on TokRepo?
self-hosted-ai is dev-infra-focused: Tabby (coding server), Onyx (RAG-as-a-service), LibreChat (multi-user chat), n8n (workflow automation). It's what you deploy to a server when you want to give your team a private ChatGPT. This pack is the individual angle: Open WebUI for personal chat, Khoj for personal notes RAG, Meetily for your own meetings, ComfyUI for your image gen. Different problem, no overlapping picks.
What about Llama 3 / Mistral / Qwen — which model should I actually pull first?
For chat and general use: llama3.1:8b-instruct-q4_K_M (4.7 GB, fast, surprisingly good). For code in Continue: qwen2.5-coder:7b (4.7 GB, better at code than Llama for size). For RAG via Khoj: same Llama 3.1 8B works. Skip 70B until you've measured that 8B is actually failing you on real tasks — most people don't need it.
Can I still use Claude or GPT for the hard problems?
Absolutely, and you should. The point of this stack isn't fundamentalism — it's that the default should be local. When you hit a problem where 70B-quant clearly fails (deep code refactor across a strange repo, frontier-level reasoning, exotic language), route that one query to a frontier model. Hybrid is the realistic endpoint; pure-local for everything is a hobbyist trap.
12 packs · 80+ recursos seleccionados
Explora todos los packs curados en la página principal
Volver a todos los packs