Scripts2026年4月12日·1 分钟阅读

Ollama — Run Large Language Models Locally

Ollama makes it easy to run open-source large language models like Llama 3, Mistral, Gemma, and Qwen on your own machine. A single command downloads and runs any model with optimized inference, GPU acceleration, and a REST API.

SC
Script Depot · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Run a model (auto-downloads on first use)
ollama run llama3.1
ollama run mistral
ollama run gemma2

# List available models
ollama list

# Pull a model without running
ollama pull deepseek-coder-v2

Introduction

Ollama is the easiest way to run large language models on your local machine. It packages model weights, configuration, and runtime into a single tool that works like Docker for LLMs — one command to pull and run any model. No Python environments, no dependency conflicts, no cloud API keys needed.

With over 169,000 GitHub stars, Ollama has become the de facto standard for local LLM inference. It supports hundreds of models including Llama 3, Mistral, Gemma, Qwen, DeepSeek, Phi, and CodeLlama, with automatic GPU acceleration on NVIDIA, AMD, and Apple Silicon.

What Ollama Does

Ollama manages the entire lifecycle of running LLMs locally. It downloads quantized model files, optimizes them for your hardware (CPU or GPU), serves them via a local REST API, and provides a chat interface in your terminal. It handles memory management, context windows, and model switching automatically.

Architecture Overview

[User] --> ollama run llama3.1
               |
        [Ollama Server]
        Runs as background service
        on port 11434
               |
    +----------+----------+
    |          |          |
[Model     [Inference  [REST API]
Registry]  Engine]    /api/generate
Pull from  llama.cpp  /api/chat
ollama.com optimized  /api/embeddings
    |      for GPU     |
    +----------+----------+
               |
        [Hardware Layer]
        NVIDIA CUDA / AMD ROCm
        Apple Metal / CPU AVX

Self-Hosting & Configuration

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

# Custom Modelfile
cat << EOF > Modelfile
FROM llama3.1
SYSTEM "You are a helpful coding assistant."
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
EOF
ollama create my-assistant -f Modelfile

# REST API usage
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1",
  "messages": [{"role": "user", "content": "Hello"}]
}'

Key Features

  • One Command Setup — install and run any model with a single command
  • GPU Acceleration — automatic detection for NVIDIA CUDA, AMD ROCm, Apple Metal
  • Model Library — hundreds of models from 1B to 405B parameters
  • REST API — OpenAI-compatible API on localhost:11434
  • Modelfile — customize models with system prompts, parameters, and adapters
  • Concurrent Requests — serve multiple users and models simultaneously
  • Quantization — run large models on consumer hardware via GGUF quantization
  • Vision Models — supports multimodal models like LLaVA and Llama 3.2 Vision

Comparison with Similar Tools

Feature Ollama LM Studio llama.cpp vLLM LocalAI
Ease of Use Very Easy Very Easy Technical Technical Moderate
GUI No (CLI) Yes No No No
API Server Built-in Built-in Optional Built-in Built-in
Model Library Curated HuggingFace Manual HuggingFace Multiple
GPU Support Auto-detect Auto-detect Manual NVIDIA only Multiple
Docker Support Yes No Community Yes Yes
Custom Models Modelfile Import Manual Config Config
GitHub Stars 169K 53K 78K 40K 28K

FAQ

Q: What hardware do I need to run LLMs with Ollama? A: For 7B models, 8GB RAM minimum. For 13B models, 16GB RAM. For 70B models, 64GB RAM or a GPU with 48GB VRAM. Apple Silicon Macs with 16GB+ unified memory work well for most models.

Q: How does Ollama compare to cloud APIs like OpenAI? A: Ollama runs 100% locally — no API costs, no data leaving your machine, no rate limits. Trade-offs: you need capable hardware, and local models may be less capable than the latest cloud models for some tasks.

Q: Can I use Ollama with my existing tools? A: Yes. Ollama provides an OpenAI-compatible API, so tools that work with OpenAI (LangChain, AutoGen, Continue, etc.) can point to localhost:11434 instead.

Q: How do I run multiple models simultaneously? A: Ollama automatically manages model loading and unloading based on available memory. You can make requests to different models and Ollama handles context switching.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产