Self-Hosted AI
Tabby, Onyx, LibreChat, and an n8n starter kit — keep your data on your own metal.
What's in this pack
This pack collects the six self-hosted AI assets that consistently show up when teams move off SaaS for compliance, cost, or sovereignty reasons. Three are coding/chat replacements (Tabby, LibreChat, Onyx). Three are infrastructure pieces (n8n AI starter kit, local STT, model gateway).
| # | Asset | Type | What it replaces |
|---|---|---|---|
| 1 | Tabby | self-hosted service | GitHub Copilot |
| 2 | Onyx | self-hosted service | Glean / enterprise ChatGPT |
| 3 | LibreChat | self-hosted UI | ChatGPT for the team |
| 4 | n8n AI starter kit | docker-compose | Zapier with AI nodes |
| 5 | Whisper STT (local) | service | Otter / Rev / cloud STT |
| 6 | Local model gateway | service | LiteLLM with local-first routing |
Why this matters
The default 2026 AI stack assumes you're fine sending your code, chats, and customer data to OpenAI / Anthropic / Google. For most consumer apps that's fine. For regulated industries (health, finance, legal), gov work, or any team where your IP is the product, it's a non-starter. This pack is the assembled answer: a stack you can run on a single workstation or a small Kubernetes cluster that gives you Copilot-equivalent dev tools, ChatGPT-equivalent chat, and enterprise-search-equivalent retrieval — entirely on your own hardware.
The three headline replacements:
- Tabby is the Copilot stand-in. Self-host it, point your IDE at it, and you get inline code completion backed by whatever local model you load (DeepSeek-Coder, Qwen-Coder, etc). On a single 3090 you'll match Copilot quality on most languages.
- Onyx (formerly Danswer) is the enterprise-search stand-in. Connect it to your Confluence, Notion, GitHub, Slack, and it builds an internal ChatGPT that answers questions from your docs. Vector + keyword hybrid search, with citations.
- LibreChat is the team-ChatGPT stand-in. Multi-user, multi-model (works with local Ollama or cloud APIs as a fallback), conversation history, prompt library. The default UI when you want to give your team "a ChatGPT" without paying per seat.
The three infrastructure pieces fill in the gaps. The n8n starter kit gives you Docker compose for n8n + Postgres + Qdrant + a local model — workflow automation on your own metal. Local Whisper means meeting transcripts and voice notes never leave your network. The model gateway routes between local and cloud models so you can fall back to Claude only when local can't answer.
Install in one command
# Install the entire pack
tokrepo install pack/self-hosted-ai
# Or pick the piece you actually need
tokrepo install tabby
tokrepo install onyx
tokrepo install librechat
tokrepo install n8n-ai-starter-kit
The TokRepo CLI installs the docker-compose files, environment templates, and the rule files / subagents for your AI tool that explain when to invoke the local stack vs the cloud. Run docker compose up -d after install and the services are reachable on localhost.
Common pitfalls
- Don't run a 70B model on 16GB VRAM. Match model size to your GPU. Tabby's DeepSeek-Coder-7B fits on a 12GB card and is plenty for completion. For chat, Qwen-2.5-32B in 4-bit is a sweet spot if you have 24GB.
- Onyx connectors silently rate-limit. When you point Onyx at a 50k-page Confluence, the initial sync takes hours and some connectors will pause. Watch the logs; don't trust the UI's progress bar in the first 24 hours.
- n8n + AI workflows leak credentials. The starter kit ships with default Postgres credentials in plaintext. Change them, and bind n8n behind Cloudflare Tunnel or a reverse proxy with auth before exposing it.
- LibreChat permissions are flat by default. Out of the box every user can see every conversation. Configure RBAC and per-user model whitelisting before you onboard a team.
- Backups aren't automatic. Self-hosted = self-backup. Schedule pg_dump for LibreChat/Onyx and snapshot the Tabby model cache; budget storage 3× your active dataset for restore points.
Relationship to other packs
This pack pairs naturally with two others. MCP Server Stack gives you the protocol-level connectors (filesystem, browser, database MCP servers) that route through your local model gateway — so even Claude Code can call your local services. LLM Observability matters more here than on cloud APIs because you own the failure surface; Langfuse self-hosted is in that pack and integrates cleanly with Onyx and LibreChat.
If you're starting from zero, install order: 1) LibreChat (immediate user value), 2) Tabby (developer value), 3) Onyx (org-wide search), 4) n8n + gateway when you start building automations on top.
6 assets in this pack
Frequently asked questions
Is Tabby free?
Yes, Tabby is open-source under Apache 2.0 with a free self-hosted Community edition. There's a paid Enterprise tier for SSO, audit logs, and SLAs, but the Community edition is fully featured for individual and small-team use. You only pay for the GPU you run it on. Same model for Onyx, LibreChat, and n8n — all OSS with optional paid tiers.
Will this work with Cursor or Codex CLI instead of Claude Code?
The self-hosted services are tool-agnostic — Tabby exposes a Copilot-compatible API that any IDE supporting Copilot can hit (VS Code, JetBrains, Vim). LibreChat is a web UI so it's tool-independent. The TokRepo CLI installs the AI-tool-specific config (Cursor rules, AGENTS.md, Claude Code subagents) that tells your agent the local services exist.
How does Tabby compare to Cursor with a local model?
Cursor's local-model support is limited to specific endpoints; Tabby is purpose-built for self-hosted code completion with telemetry, model warmup, and a real backend. If you want IDE-agnostic, multi-team self-hosted Copilot, Tabby wins. If you specifically want Cursor's UX with a local model behind it, see the local model gateway in this pack — it can act as a Cursor-compatible endpoint.
What's the difference vs the MCP Server Stack pack?
MCP Server Stack is about protocol-level connectors so AI tools can read your filesystem, browser, database. Self-Hosted AI is about replacing the cloud LLM/UI/IDE assistant entirely with services on your own hardware. They're complementary: the MCP servers can be configured to route through your local model gateway, giving you a fully on-prem agent stack.
When should I NOT self-host?
When latency matters more than sovereignty (real-time voice, sub-300ms code completion against a small model is hard), when your usage is too low to justify a GPU ($100/mo of API calls is cheaper than a 4090 amortized over 3 years), or when you don't have ops support to handle backups, model upgrades, and the inevitable 2 a.m. OOM. Self-hosting is real ops work; budget it.
12 packs · 80+ hand-picked assets
Browse every curated bundle on the home page
Back to all packs