Local LLM
Ollama — Run LLMs Locally with One Command (2026 Guide) logo

Ollama — Run LLMs Locally with One Command (2026 Guide)

Ollama is the most popular way to run large language models locally. A single CLI downloads, quantizes, and serves models with an OpenAI-compatible API — the easiest on-ramp to self-hosted AI in 2026.

Why Ollama

Ollama won on simplicity. ollama run llama3.2 downloads a quantized model, starts a local server, and drops you into a chat — all in one command. Under the hood it’s llama.cpp with polished model management, an OpenAI-compatible HTTP API, and first-class support on macOS, Linux, and Windows. The experience is noticeably smoother than rolling your own llama.cpp setup.

The bet worked. In 2026 Ollama is the default choice for "I want a local LLM on my laptop". Every popular dev tool (Cursor, Claude Code, Zed, Obsidian plugins, many VS Code extensions) supports Ollama as a provider out of the box because the HTTP API is identical to OpenAI’s. You install Ollama, pull a model, point your tool at http://localhost:11434, done.

Where Ollama is not the answer: serving many concurrent users (use vLLM), maximum Apple Silicon throughput (use MLX), or research tooling like LoRA training (use text-generation-webui). For personal and small-team inference, Ollama is almost always the right first pick.

Quick Start — Install, Pull, Use

ollama run will pull the model on first use and drop you into an interactive chat. ollama serve exposes the HTTP API (port 11434 by default). Every major Ollama-compatible client uses the /v1/chat/completions path under that base URL.

# 1. Install (macOS / Linux / Windows)
curl -fsSL https://ollama.com/install.sh | sh
# or: brew install ollama   # macOS homebrew
# Windows: download installer from ollama.com

# 2. Run a model — downloads ~2-5GB the first time
ollama run llama3.2        # Meta Llama 3.2 3B, quantized
ollama run qwen2.5:14b     # Alibaba Qwen 2.5 14B
ollama run deepseek-r1     # DeepSeek R1 reasoning model

# 3. Use the OpenAI-compatible API from any client
# The server listens on localhost:11434 after 'ollama serve' (auto on install)

# Python with the OpenAI SDK:
python - <<'PY'
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
r = client.chat.completions.create(
    model="llama3.2",
    messages=[{"role":"user","content":"Name one Go stdlib package you underrate."}],
)
print(r.choices[0].message.content)
PY

# Use the same endpoint with Cursor, Claude Code, Zed — configure as an
# OpenAI-compatible provider with base URL http://localhost:11434/v1.

Key Features

One-command install + run

Single binary, single command to try a model. No Python environment, no CUDA fiddling on Linux, no model-conversion scripts. The lowest-friction local LLM experience.

OpenAI-compatible API

Chat completions, streaming, tool calling, and embedding endpoints all match the OpenAI shape. Any OpenAI SDK or tool that accepts a base_url override works with Ollama unchanged.

Model library

ollama.com/library curates popular models with ready-made quantizations. Llama 3.x, Qwen 2.5, Mistral, Gemma, Phi, DeepSeek, and more — all one command away.

Modelfile system

Create custom models by writing a Modelfile (system prompt, temperature, base model). ollama create mybot -f Modelfile. Makes it easy to share fine-tuned personalities across a team.

Native Apple / CUDA / ROCm

Uses Metal on macOS, CUDA on NVIDIA, ROCm on AMD, CPU everywhere. Chooses the best backend automatically.

Embeddings + multimodal

Also serves embedding models and vision-language models (LLaVA, Qwen-VL, Gemma 3 vision). Unified API, unified model management.

Comparison

 Install ComplexityAPI CompatibilityThroughputBest For
OllamathisVery lowOpenAI-compatible (native)Good (llama.cpp backend)Desktop + small-team servers
LM StudioLow (GUI)OpenAI-compatibleGoodWindows/Mac GUI users
llama.cpp (server)MediumOpenAI-compatibleGoodFull control, portability
vLLMHighOpenAI-compatibleExcellent (GPU)Production multi-user GPU

Use Cases

01. Personal developer assistant

Point Cursor/Claude Code/Zed at Ollama for offline code suggestions on a laptop. Privacy, no API bills, good-enough quality for routine tasks.

02. Internal team LLM

Deploy Ollama on a shared GPU server and expose http://server:11434 internally. Small teams (<20 people) can share a single instance with acceptable latency.

03. Dev/staging environments

Same API as OpenAI means you can swap base_url in config to point at Ollama for dev and OpenAI for production — useful for testing without burning API budget.

Pricing & License

Ollama: MIT open source. Free to use commercially. No telemetry by default; explicitly opt-in for usage stats.

Hardware cost: Ollama itself is free. Model quality scales with RAM/VRAM: 7B models run on 8GB machines (4-bit quant), 70B need 32GB+ RAM or 48GB VRAM. See individual model pages for requirements.

Time cost: first-run downloads are 2-50GB depending on model size. After that, local use is free (unless you count electricity).

Related Assets on TokRepo

Frequently Asked Questions

Does Ollama work offline?+

Yes — after the initial model download, everything runs locally. No internet needed for inference. Useful for flights, secure environments, and data-sensitive work.

Ollama vs LM Studio?+

Both wrap llama.cpp with excellent DX. Ollama is CLI-first with a strong Docker/server story. LM Studio is GUI-first with a built-in model browser. Many users install both. For scripted / automated / team scenarios, Ollama wins. For "my colleague who doesn’t use a terminal", LM Studio wins.

Can Ollama do tool calls / function calling?+

Yes — since v0.4. Tool support varies by model. Llama 3.1/3.2, Qwen 2.5, and Mistral v0.3+ ship fine-tuned tool-call weights. Use the standard OpenAI tools= parameter via the chat completions endpoint.

How do I run Ollama in production?+

Docker image is official and well-maintained. Expose port 11434 behind a reverse proxy with auth. Use environment variables OLLAMA_HOST and OLLAMA_MODELS for bind address and model cache dir. For multi-user concurrency, limit OLLAMA_NUM_PARALLEL and consider switching to vLLM if you exceed 5-10 concurrent requests.

Which models are best for coding?+

In 2026, Qwen 2.5 Coder 32B and DeepSeek Coder V2 are the top open options; both run comfortably on 24GB VRAM or 32GB Apple Silicon unified memory with 4-bit quantization. For smaller hardware, try Qwen 2.5 Coder 7B or deepseek-r1-distill-qwen-14b.

Can Ollama serve embedding models?+

Yes — ollama pull nomic-embed-text or mxbai-embed-large then POST to /api/embed. Same HTTP server, same Modelfile concept, different endpoint.

Compare Alternatives