ScriptsMar 29, 2026·2 min read

Ollama — Run LLMs Locally

Run large language models locally on your machine. Supports Llama 3, Mistral, Gemma, Phi, and dozens more. One-command install, OpenAI-compatible API.

TL;DR
Ollama downloads and runs open-source LLMs locally with one command, keeping all inference on your hardware.
§01

What it is

Ollama is a tool for running open-source large language models on your local machine. It handles model downloading, quantization, memory management, and serving behind a simple CLI and HTTP API. You run ollama run llama3.1 and get an interactive chat session. You call localhost:11434/api/generate and get a streaming inference API. No cloud accounts, no API keys, no usage fees.

Ollama supports a wide range of models including Llama 3.1, Mistral, Gemma, Code Llama, Phi, and community-contributed models. It runs on macOS, Linux, and Windows, with automatic GPU acceleration on Apple Silicon (Metal), NVIDIA (CUDA), and AMD (ROCm) hardware.

§02

How it saves time or tokens

Every API call to a cloud LLM costs tokens and money. For development workflows where you make hundreds of test calls per day (prompt iteration, function calling tests, embedding generation), those costs add up. Ollama eliminates per-token costs entirely by running inference on hardware you already own.

The time savings come from zero network latency for local inference, no rate limiting, and no API key management. For privacy-sensitive tasks like processing internal documents or generating code from proprietary codebases, local inference means the data never leaves your machine.

§03

How to use

  1. Install Ollama with one command:

```bash

curl -fsSL https://ollama.com/install.sh | sh

```

On macOS, you can also download the desktop app from ollama.com.

  1. Run a model:

```bash

ollama run llama3.1

```

This downloads the model on first run (typically 4-8 GB) and starts an interactive chat.

  1. Use the HTTP API for programmatic access:

```bash

curl http://localhost:11434/api/generate -d '{

"model": "llama3.1",

"prompt": "Explain quicksort in 3 sentences."

}'

```

§04

Example

Using Ollama with Python via the official library:

import ollama

# Simple completion
response = ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'What is a closure in JavaScript?'}]
)
print(response['message']['content'])

# Streaming
for chunk in ollama.chat(
    model='llama3.1',
    messages=[{'role': 'user', 'content': 'Write a haiku about code.'}],
    stream=True
):
    print(chunk['message']['content'], end='', flush=True)

Popular models and their approximate sizes:

ModelParametersDownload SizeBest For
llama3.18B4.7 GBGeneral chat and reasoning
llama3.1:70b70B40 GBComplex reasoning (needs 48GB+ RAM)
mistral7B4.1 GBFast general-purpose inference
codellama7B3.8 GBCode generation and completion
gemma29B5.4 GBGoogle's open model
§05

Related on TokRepo

  • Local LLM Tools — Compare Ollama with other local inference options like LM Studio, llama.cpp, and vLLM.
  • Local LLM: Ollama — Deep dive into Ollama workflows and configurations on TokRepo.
§06

Common pitfalls

  • Running a 70B model on a machine with 16GB RAM. Large models require memory roughly equal to their download size. If you exceed available RAM, Ollama will use swap, which makes inference extremely slow. Start with 7B-8B models and scale up based on your hardware.
  • Forgetting that Ollama serves on port 11434 by default. If another service uses that port, Ollama will fail to start. Set OLLAMA_HOST environment variable to change the port.
  • Expecting cloud-level speed from consumer hardware. A 7B model on an M1 MacBook generates roughly 30-50 tokens per second. A 70B model on the same hardware generates 3-5 tokens per second. Set expectations based on your hardware.

Frequently Asked Questions

What hardware do I need to run Ollama?+

For 7B-8B parameter models, you need at least 8GB of RAM and a reasonably modern CPU. GPU acceleration significantly improves speed: Apple Silicon Macs use Metal, NVIDIA GPUs use CUDA, and AMD GPUs use ROCm. For 70B models, you need 48GB or more of RAM. Ollama handles quantization automatically, so you do not need to manually convert models to lower precision.

How does Ollama compare to using a cloud LLM API?+

Ollama runs entirely on your local hardware with zero per-token cost and full data privacy. Cloud APIs offer larger models, faster inference on high-end hardware, and no local resource usage. Choose Ollama for development iteration, privacy-sensitive tasks, and offline use. Choose cloud APIs for production workloads, the largest models, and when you need guaranteed uptime.

Can I use Ollama with my existing AI coding tools?+

Yes. Many AI coding tools support Ollama as a backend, including Continue (VS Code extension), Open Interpreter, and LangChain. Any tool that can call an OpenAI-compatible API can point to Ollama's local endpoint at localhost:11434 with the /v1 compatibility path enabled. This lets you substitute a local model for cloud API calls.

How do I add custom or fine-tuned models to Ollama?+

Create a Modelfile that specifies the base model and any customizations (system prompt, parameters, template). Run ollama create my-model -f Modelfile to register it. For GGUF-format models from sources like Hugging Face, reference the GGUF file path in the Modelfile FROM line. Ollama handles quantization and memory mapping automatically.

Does Ollama support function calling and tool use?+

Yes. Ollama supports tool calling with models that have been trained for it, such as Llama 3.1. You pass a tools array in the API request describing available functions, and the model can return tool call requests in its response. The format follows the OpenAI-compatible schema, making it straightforward to integrate with existing tool-use frameworks.

Citations (3)
🙏

Source & Thanks

Created by Ollama. Licensed under MIT. ollama/ollama — 120K+ GitHub stars

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets