Pal MCP Server — Multi-Model AI Gateway for Claude Code
MCP server that lets Claude Code use Gemini, OpenAI, Grok, and Ollama as a unified AI dev team. Features model routing, CLI-to-CLI bridge, and conversation continuity across 7+ providers.
Safe staging for this asset
This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.
npx -y tokrepo@latest install 09c904b2-4bf7-4f1e-acf5-55cd465b6227 --target codexStages files first; activation requires review of the staged README and plan.
Why multi-model matters inside one agent
Claude Code is excellent at reasoning. Gemini 2.5 Pro has a 2M context window. GPT-4o is fast. Grok has live web access. Ollama runs offline. A real dev team uses all of them. Pal MCP collapses that into one tool call from Claude Code's perspective — ask it to "call Gemini on this 1.5M-token codebase" and Pal routes the request, returns the result, and maintains conversation continuity.
Single-config setup
Add to .mcp.json:
{
"mcpServers": {
"pal": {
"command": "uvx",
"args": ["--from", "git+https://github.com/BeehiveInnovations/pal-mcp-server.git", "pal-mcp-server"],
"env": {
"GEMINI_API_KEY": "your-gemini-key",
"OPENAI_API_KEY": "your-openai-key",
"DEFAULT_MODEL": "auto"
}
}
}
}
Restart Claude Code. Now pal_chat, pal_route, and pal_continue are callable.
The routing logic
Set DEFAULT_MODEL=auto and Pal picks a model based on task heuristics:
| Task signal | Routed model | Why |
|---|---|---|
| Context > 200K tokens | Gemini 2.5 Pro | 2M context window |
| Needs live web facts | Grok | Twitter/X integration |
| Code completion loops | Ollama Codellama | Free, fast, local |
| Long reasoning chains | o3-preview | Best deliberation |
| Default | Claude Sonnet | Quality baseline |
Override per-call with pal_chat(model="gpt-4o").
CLI-to-CLI bridge
Pal exposes a raw CLI bridge: call Aider, Continue, or any CLI-based agent from within Claude Code. Useful for chaining specialized agents in a single workflow.
Conversation continuity
Every Pal call can continue an existing thread:
pal_continue(thread_id="xyz", prompt="refactor based on Gemini's suggestions")
Thread state is persisted in SQLite under ~/.pal/threads.db. Survives restarts.
Supported providers in 2026
- Anthropic (Claude Opus, Sonnet, Haiku)
- OpenAI (GPT-4o, o3, o3-mini)
- Google (Gemini 2.5 Pro, Flash)
- xAI (Grok-3)
- DeepSeek (R1, V3)
- Ollama (local, 50+ models)
- LiteLLM (proxy for 100+ more)
Cost control
Pal emits a cost-summary per session: total tokens, per-model breakdown, $ estimate. Use MAX_COST_PER_SESSION=5 env var to hard-stop runaway loops.
When Pal is not the right choice
- Single-model workflows — overhead not worth it, use the provider SDK directly.
- Production agents — MCP is still evolving; use LiteLLM Proxy for production-grade routing.
- Compliance-regulated environments — each upstream provider has different data policies; Pal doesn't unify compliance.
Frequently Asked Questions
LiteLLM is a Python proxy library designed for production backends. Pal is an MCP server designed for interactive use inside agents like Claude Code. Pal adds thread continuity and CLI bridging that LiteLLM does not provide, but LiteLLM has stronger production-grade features like retries and load balancing.
Yes. Ollama is a first-class provider. Point Pal at your local Ollama instance with OLLAMA_BASE_URL and it will route appropriate tasks to your local models. Useful for offline work or privacy-sensitive data.
Yes. Any MCP-compatible client works: Cursor, Codex CLI, Zed, Cline, and others. The MCP protocol is standardized so Pal behaves identically across them.
Yes. Set MAX_COST_PER_SESSION environment variable to hard-stop sessions that exceed the limit. Pal also emits a per-call cost summary so you can track spending in real time.
With DEFAULT_MODEL=auto, Pal picks based on task heuristics — Gemini for huge context, Grok for live web facts, Ollama for local code completion, o3 for long reasoning, Claude Sonnet as the quality baseline.
Citations (3)
- Pal MCP GitHub— Supports Gemini, OpenAI, Grok, DeepSeek, Ollama, and LiteLLM proxy
- Model Context Protocol— MCP protocol specification by Anthropic
- Google DeepMind— Gemini 2.5 Pro has a 2M token context window
Related on TokRepo
Source & Thanks
Created by BeehiveInnovations. Licensed under custom license.
pal-mcp-server — ⭐ 11,300+
Thank you for building a powerful multi-model gateway for the AI developer community.
Discussion
Related Assets
Linear MCP — Project Management for Claude Code & Cursor
Linear's official MCP lets Claude Code, Cursor, Codex CLI manage Linear issues, projects, cycles. Search by status, create issues from chat, link PRs.
OpenRouter MCP — One Server for 300+ LLMs in Claude Code
OpenRouter MCP exposes all 300+ OpenRouter models to Claude Code, Cursor, Codex CLI as one MCP server. Switch models per task, BYO routing, no extra SDKs.
MCP SSH Manager — Remote Ops via Claude/Codex
MCP SSH Manager is an MCP server that lets Claude Code and OpenAI Codex manage SSH sessions: run commands, sync files, and automate DevOps routines.
pentest-ai — Offensive Security MCP for Claude Code
pentest-ai is a Python CLI and MCP server that lets Claude Code run verified probes, chain attack paths, and export reports for authorized testing.