Ollama — Run LLMs Locally with One Command (2026 Guide)
Ollama is the most popular way to run large language models locally. A single CLI downloads, quantizes, and serves models with an OpenAI-compatible API — the easiest on-ramp to self-hosted AI in 2026.
Why Ollama
Ollama won on simplicity. ollama run llama3.2 downloads a quantized model, starts a local server, and drops you into a chat — all in one command. Under the hood it’s llama.cpp with polished model management, an OpenAI-compatible HTTP API, and first-class support on macOS, Linux, and Windows. The experience is noticeably smoother than rolling your own llama.cpp setup.
The bet worked. In 2026 Ollama is the default choice for "I want a local LLM on my laptop". Every popular dev tool (Cursor, Claude Code, Zed, Obsidian plugins, many VS Code extensions) supports Ollama as a provider out of the box because the HTTP API is identical to OpenAI’s. You install Ollama, pull a model, point your tool at http://localhost:11434, done.
Where Ollama is not the answer: serving many concurrent users (use vLLM), maximum Apple Silicon throughput (use MLX), or research tooling like LoRA training (use text-generation-webui). For personal and small-team inference, Ollama is almost always the right first pick.
Quick Start — Install, Pull, Use
ollama run will pull the model on first use and drop you into an interactive chat. ollama serve exposes the HTTP API (port 11434 by default). Every major Ollama-compatible client uses the /v1/chat/completions path under that base URL.
# 1. Install (macOS / Linux / Windows)
curl -fsSL https://ollama.com/install.sh | sh
# or: brew install ollama # macOS homebrew
# Windows: download installer from ollama.com
# 2. Run a model — downloads ~2-5GB the first time
ollama run llama3.2 # Meta Llama 3.2 3B, quantized
ollama run qwen2.5:14b # Alibaba Qwen 2.5 14B
ollama run deepseek-r1 # DeepSeek R1 reasoning model
# 3. Use the OpenAI-compatible API from any client
# The server listens on localhost:11434 after 'ollama serve' (auto on install)
# Python with the OpenAI SDK:
python - <<'PY'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(
model="llama3.2",
messages=[{"role":"user","content":"Name one Go stdlib package you underrate."}],
)
print(r.choices[0].message.content)
PY
# Use the same endpoint with Cursor, Claude Code, Zed — configure as an
# OpenAI-compatible provider with base URL http://localhost:11434/v1.Key Features
One-command install + run
Single binary, single command to try a model. No Python environment, no CUDA fiddling on Linux, no model-conversion scripts. The lowest-friction local LLM experience.
OpenAI-compatible API
Chat completions, streaming, tool calling, and embedding endpoints all match the OpenAI shape. Any OpenAI SDK or tool that accepts a base_url override works with Ollama unchanged.
Model library
ollama.com/library curates popular models with ready-made quantizations. Llama 3.x, Qwen 2.5, Mistral, Gemma, Phi, DeepSeek, and more — all one command away.
Modelfile system
Create custom models by writing a Modelfile (system prompt, temperature, base model). ollama create mybot -f Modelfile. Makes it easy to share fine-tuned personalities across a team.
Native Apple / CUDA / ROCm
Uses Metal on macOS, CUDA on NVIDIA, ROCm on AMD, CPU everywhere. Chooses the best backend automatically.
Embeddings + multimodal
Also serves embedding models and vision-language models (LLaVA, Qwen-VL, Gemma 3 vision). Unified API, unified model management.
Comparison
| Install Complexity | API Compatibility | Throughput | Best For | |
|---|---|---|---|---|
| Ollamathis | Very low | OpenAI-compatible (native) | Good (llama.cpp backend) | Desktop + small-team servers |
| LM Studio | Low (GUI) | OpenAI-compatible | Good | Windows/Mac GUI users |
| llama.cpp (server) | Medium | OpenAI-compatible | Good | Full control, portability |
| vLLM | High | OpenAI-compatible | Excellent (GPU) | Production multi-user GPU |
Use Cases
01. Personal developer assistant
Point Cursor/Claude Code/Zed at Ollama for offline code suggestions on a laptop. Privacy, no API bills, good-enough quality for routine tasks.
02. Internal team LLM
Deploy Ollama on a shared GPU server and expose http://server:11434 internally. Small teams (<20 people) can share a single instance with acceptable latency.
03. Dev/staging environments
Same API as OpenAI means you can swap base_url in config to point at Ollama for dev and OpenAI for production — useful for testing without burning API budget.
Pricing & License
Ollama: MIT open source. Free to use commercially. No telemetry by default; explicitly opt-in for usage stats.
Hardware cost: Ollama itself is free. Model quality scales with RAM/VRAM: 7B models run on 8GB machines (4-bit quant), 70B need 32GB+ RAM or 48GB VRAM. See individual model pages for requirements.
Time cost: first-run downloads are 2-50GB depending on model size. After that, local use is free (unless you count electricity).
Related Assets on TokRepo
Pal MCP Server — Multi-Model AI Gateway for Claude Code
MCP server that lets Claude Code use Gemini, OpenAI, Grok, and Ollama as a unified AI dev team. Features model routing, CLI-to-CLI bridge, and conversation continuity across 7+ providers.
Ollama Model Library — Best AI Models for Local Use
Curated guide to the best models available on Ollama for coding, chat, and reasoning. Compare Llama, Mistral, Gemma, Phi, and Qwen models for local AI development.
Open WebUI — Self-Hosted AI Chat Platform
Feature-rich, offline-capable AI interface for Ollama, OpenAI, and local LLMs. Built-in RAG, voice, model builder. 130K+ stars.
Self-Hosted AI Starter Kit — Local AI with n8n
Docker Compose template by n8n that bootstraps a complete local AI environment with n8n workflow automation, Ollama LLMs, Qdrant vector database, and PostgreSQL. 14,500+ stars.
Frequently Asked Questions
Does Ollama work offline?+
Yes — after the initial model download, everything runs locally. No internet needed for inference. Useful for flights, secure environments, and data-sensitive work.
Ollama vs LM Studio?+
Both wrap llama.cpp with excellent DX. Ollama is CLI-first with a strong Docker/server story. LM Studio is GUI-first with a built-in model browser. Many users install both. For scripted / automated / team scenarios, Ollama wins. For "my colleague who doesn’t use a terminal", LM Studio wins.
Can Ollama do tool calls / function calling?+
Yes — since v0.4. Tool support varies by model. Llama 3.1/3.2, Qwen 2.5, and Mistral v0.3+ ship fine-tuned tool-call weights. Use the standard OpenAI tools= parameter via the chat completions endpoint.
How do I run Ollama in production?+
Docker image is official and well-maintained. Expose port 11434 behind a reverse proxy with auth. Use environment variables OLLAMA_HOST and OLLAMA_MODELS for bind address and model cache dir. For multi-user concurrency, limit OLLAMA_NUM_PARALLEL and consider switching to vLLM if you exceed 5-10 concurrent requests.
Which models are best for coding?+
In 2026, Qwen 2.5 Coder 32B and DeepSeek Coder V2 are the top open options; both run comfortably on 24GB VRAM or 32GB Apple Silicon unified memory with 4-bit quantization. For smaller hardware, try Qwen 2.5 Coder 7B or deepseek-r1-distill-qwen-14b.
Can Ollama serve embedding models?+
Yes — ollama pull nomic-embed-text or mxbai-embed-large then POST to /api/embed. Same HTTP server, same Modelfile concept, different endpoint.