TOKREPO · Arsenal de IA

Estable

IA Local-First — Tus Datos Nunca Salen del Portátil

Nueve picks open-source para un flujo de IA completo — chat, RAG sobre tus documentos, código, transcripción, generación de imágenes — todo corriendo en tu máquina. Sin claves OpenAI, sin facturas de tokens.

9 recursos

Sobre este pack

What's in this pack

This is the rig you build when you've decided your journal, your client recordings, and your half-written code are not going into someone else's training set. Every tool here is open-source, actively maintained, and runs with no outbound network call required once the models are downloaded.

The motivation is rarely just privacy in the abstract. It's three concrete things stacked: (1) the monthly token bill that scales with how curious you are, (2) terms of service that change, and (3) the dawning realization that you've been pasting your entire inbox into a chat window owned by a company that openly indexes it. A local stack fixes all three permanently.

This pack is not the same as our self-hosted-ai pack — that one is for shipping a SaaS on your own metal (Tabby, Onyx, LibreChat, n8n). This one is for individuals who want a private AI on a personal machine, including non-developer tools like meeting transcription and a notes app.

Install in this order

Ollama — model runner. Start here. Single command (curl -fsSL ollama.com/install.sh | sh), pulls models with ollama pull llama3.1, exposes an OpenAI-compatible API on localhost:11434. Everything downstream points at this.
GPT4All — alternative model runner with a GUI. If you don't live in a terminal, install this instead of (or alongside) Ollama. Same job, friendlier surface for non-devs.
Open WebUI — the local ChatGPT replacement. Talks to Ollama out of the box, supports multi-turn chat, RAG over uploaded files, web search plugins. This is where 80% of "I just want to ask the AI something" happens.
Continue — local coding assistant for VS Code and JetBrains. Configure it to call your local Ollama model instead of Copilot's servers. Inline edits, chat, refactor — all on-device. Slower than Copilot, but your private repo never leaves the machine.
Khoj — AI second brain. Indexes your Markdown notes, PDFs, org-mode, even Notion exports, then lets you chat with them via local LLM. This is the RAG layer for your life, not your codebase.
Faster Whisper — speech-to-text. 4x faster than vanilla Whisper, runs on CPU or GPU, OpenAI Whisper accuracy. Drop audio in, get a transcript out. Foundation for the next tool.
Meetily — privacy-first meeting assistant. Records, transcribes via Whisper locally, summarizes via your local LLM. Zoom/Meet recordings never touch a cloud.
ComfyUI — local image generation via Stable Diffusion. Node-based, fast on Apple Silicon and CUDA, runs SDXL / Flux / SD3 models pulled from Hugging Face. No prompt logging, no content policy, no usage cap.
Joplin — privacy-focused note app with optional end-to-end encryption. Where you keep the source material your local AI reads. Markdown, plugins, syncs between devices via your own storage.

How they fit together

        ┌─────────────────────────────────────┐
        │   Your laptop (no outbound calls)   │
        └─────────────────────────────────────┘
                       │
  ┌────────────────────┴────────────────────┐
  │                                          │
Ollama / GPT4All  ◄──── OpenAI-compatible API ────┐
  (model runner)                                  │
  │                                                │
  ├─► Open WebUI  ─── chat in browser              │
  │                                                │
  ├─► Continue    ─── code in VS Code              │
  │                                                │
  ├─► Khoj        ─── chat with your notes ◄── Joplin
  │                                                │
  └─► Meetily     ─── meeting summary ◄── Faster Whisper
                                                   │
ComfyUI ── standalone (its own model runtime) ─────┘

The trick is that all six client tools (Open WebUI, Continue, Khoj, Meetily, plus anything else you wire up) point at the single Ollama endpoint. You download a model once. Every app reuses it. Disk and RAM are the budgets to watch, not API quota.

Tradeoffs you'll hit

Cloud quality vs local quality — Be honest: GPT-5 / Claude 4.5 still beat any 8B-quant local model at frontier reasoning, long-context, and code generation on unfamiliar codebases. Local wins on privacy, latency for short prompts, cost at volume, and offline use. The right mental model is "local for 80% of daily work, cloud for the hard 20%" — not "local replaces cloud".
Apple Silicon vs NVIDIA — Apple Silicon M2/M3/M4 with 32 GB+ RAM runs 13B models comfortably via Metal/MPS. NVIDIA with 16 GB+ VRAM is faster on bigger models but louder, hotter, more expensive. Most of this pack runs well on a $2K Mac; ComfyUI and 70B models start asking for a real GPU.
Quantized vs full precision — Most Ollama models default to Q4_K_M (4-bit quantization). You lose maybe 2-3% accuracy for 4x less RAM. Always start quantized. Only go full precision if you can measure a quality gap that matters to you.

Common pitfalls

RAM blow-ups — running Open WebUI + Continue + Khoj simultaneously, each holding a model in memory, will OOM a 16 GB machine. Configure Ollama with OLLAMA_MAX_LOADED_MODELS=1 and let it page models in and out.
Model files are huge — Llama 3.1 70B is 40 GB on disk. Plan storage before you ollama pull everything that looks interesting. Keep a kill list.
MPS vs CUDA confusion — most install guides assume NVIDIA. On Apple Silicon, check for the -metal or mps variant of each tool. ComfyUI in particular needs the right Python wheel.
"Actually I do need cloud for X" — be at peace with it. Routing your frontier-difficulty queries to Claude/GPT through a privacy-aware client (LibreChat with logging off, or just the API with Bearer and no organization ID) is a sane hybrid.
Voice assistant ambition — Meetily + Faster Whisper handle batch transcription beautifully. Real-time conversational voice (sub-500ms latency, interruption) is still genuinely hard locally. Don't promise that to yourself in week one.

INSTALAR · UN COMANDO

$ tokrepo install pack/local-first-ai

pásalo a tu agente — o pégalo en tu terminal

Qué incluye

9 recursos listos para instalar

Skill#01

Ollama — Run LLMs Locally

Run large language models locally on your machine. Supports Llama 3, Mistral, Gemma, Phi, and dozens more. One-command install, OpenAI-compatible API.

by Script Depot·384 views

$ tokrepo install ollama-run-llms-locally-0eefb7ad

Skill#02

GPT4All — Run LLMs Privately on Your Desktop

GPT4All runs large language models privately on everyday desktops and laptops without GPUs or API calls. 77.2K+ GitHub stars. Desktop app + Python SDK, LocalDocs for private data. MIT licensed.

by AI Open Source·357 views

$ tokrepo install gpt4all-run-llms-privately-your-desktop-f493abd9

Skill#03

Open WebUI — Self-Hosted AI Chat Interface

User-friendly, self-hosted AI chat interface. Supports Ollama, OpenAI, Anthropic, and any OpenAI-compatible API. RAG, web search, voice, image gen, and plugins. 129K+ stars.

by Script Depot·417 views

$ tokrepo install open-webui-self-hosted-ai-chat-interface-5d37ffb8

Skill#04

Continue — Open-Source AI Code Assistant

Open-source AI code assistant for VS Code and JetBrains. Tab autocomplete, chat, inline editing with any model — OpenAI, Anthropic, Ollama, or self-hosted.

by Continue·403 views

$ tokrepo install continue-open-source-ai-code-assistant-8040c0e5

Skill#05

Khoj — Your AI Second Brain

Khoj is a personal AI app for chat, search, and knowledge management. 33.8K+ stars. Multi-LLM, docs, Obsidian, WhatsApp, custom agents. AGPL-3.0.

by AI Open Source·260 views

$ tokrepo install khoj-your-ai-second-brain-4cbd3b7b

Skill#06

Faster Whisper — 4x Faster Speech-to-Text

Faster Whisper is a reimplementation of OpenAI Whisper using CTranslate2, up to 4x faster with less memory. 21.8K+ GitHub stars. GPU/CPU, 8-bit quantization, word timestamps, VAD. MIT licensed.

by Script Depot·375 views

$ tokrepo install faster-whisper-4x-faster-speech-text-24576b2c

Skill#07

Meetily — Privacy-First AI Meeting Assistant with Local Transcription

An open-source, self-hosted AI meeting assistant that provides real-time transcription, speaker diarization, and local summarization using Whisper and Ollama, with no cloud dependency.

by AI Open Source·294 views

$ tokrepo install meetily-privacy-first-ai-meeting-assistant-local-3270e558

Skill#08

ComfyUI — Node-Based AI Image Generation

The most powerful modular AI image generation GUI with a node/graph editor. Supports Stable Diffusion, Flux, SDXL, ControlNet, and 1000+ custom nodes. 107K+ stars.

by AI Open Source·424 views

$ tokrepo install comfyui-node-based-ai-image-generation-02888d06

Skill#09

Joplin — Privacy-Focused Open-Source Note Taking App

Joplin is a privacy-focused note taking app with sync capabilities for Windows, macOS, Linux, Android, and iOS. Markdown-based, end-to-end encrypted sync, supports Nextcloud, Dropbox, OneDrive, S3, and WebDAV. The open-source alternative to Evernote.

by Script Depot·323 views

$ tokrepo install joplin-privacy-focused-open-source-note-taking-app-42403801

Preguntas frecuentes

Is local AI really private if I'm pulling models from Hugging Face / Ollama?

Yes — the model download is a one-time fetch of weights. Once the file is on disk, the model runs entirely offline. No prompt, no document, no transcript is ever sent to Hugging Face or Ollama servers. Verify with Little Snitch or lsof -i if you want proof. The trust boundary is the open-source model itself, not the distribution channel.

What hardware do I actually need for this stack?

Comfortable: Apple Silicon Mac with 32 GB unified RAM, or a Windows/Linux box with an NVIDIA GPU with 16 GB+ VRAM. Minimum viable: 16 GB RAM Mac runs 7-8B models and Faster Whisper fine but you'll juggle one model at a time. ComfyUI (image gen) is the most demanding piece; everything else is reasonable on a 4-year-old laptop.

How does this differ from the existing self-hosted-ai pack on TokRepo?

self-hosted-ai is dev-infra-focused: Tabby (coding server), Onyx (RAG-as-a-service), LibreChat (multi-user chat), n8n (workflow automation). It's what you deploy to a server when you want to give your team a private ChatGPT. This pack is the individual angle: Open WebUI for personal chat, Khoj for personal notes RAG, Meetily for your own meetings, ComfyUI for your image gen. Different problem, no overlapping picks.

What about Llama 3 / Mistral / Qwen — which model should I actually pull first?

For chat and general use: llama3.1:8b-instruct-q4_K_M (4.7 GB, fast, surprisingly good). For code in Continue: qwen2.5-coder:7b (4.7 GB, better at code than Llama for size). For RAG via Khoj: same Llama 3.1 8B works. Skip 70B until you've measured that 8B is actually failing you on real tasks — most people don't need it.

Can I still use Claude or GPT for the hard problems?

Absolutely, and you should. The point of this stack isn't fundamentalism — it's that the default should be local. When you hit a problem where 70B-quant clearly fails (deep code refactor across a strange repo, frontier-level reasoning, exotic language), route that one query to a frontier model. Hybrid is the realistic endpoint; pure-local for everything is a hobbyist trap.

MÁS DEL ARSENAL

12 packs · 80+ recursos seleccionados

Explora todos los packs curados en la página principal

Volver a todos los packs