# Ollama — Run Large Language Models Locally > Ollama makes it easy to run open-source large language models like Llama 3, Mistral, Gemma, and Qwen on your own machine. A single command downloads and runs any model with optimized inference, GPU acceleration, and a REST API. ## Install Save as a script file and run: # Ollama — Run Large Language Models Locally ## Quick Use ```bash # Install Ollama curl -fsSL https://ollama.com/install.sh | sh # Run a model (auto-downloads on first use) ollama run llama3.1 ollama run mistral ollama run gemma2 # List available models ollama list # Pull a model without running ollama pull deepseek-coder-v2 ``` ## Introduction Ollama is the easiest way to run large language models on your local machine. It packages model weights, configuration, and runtime into a single tool that works like Docker for LLMs — one command to pull and run any model. No Python environments, no dependency conflicts, no cloud API keys needed. With over 169,000 GitHub stars, Ollama has become the de facto standard for local LLM inference. It supports hundreds of models including Llama 3, Mistral, Gemma, Qwen, DeepSeek, Phi, and CodeLlama, with automatic GPU acceleration on NVIDIA, AMD, and Apple Silicon. ## What Ollama Does Ollama manages the entire lifecycle of running LLMs locally. It downloads quantized model files, optimizes them for your hardware (CPU or GPU), serves them via a local REST API, and provides a chat interface in your terminal. It handles memory management, context windows, and model switching automatically. ## Architecture Overview ``` [User] --> ollama run llama3.1 | [Ollama Server] Runs as background service on port 11434 | +----------+----------+ | | | [Model [Inference [REST API] Registry] Engine] /api/generate Pull from llama.cpp /api/chat ollama.com optimized /api/embeddings | for GPU | +----------+----------+ | [Hardware Layer] NVIDIA CUDA / AMD ROCm Apple Metal / CPU AVX ``` ## Self-Hosting & Configuration ```bash # macOS brew install ollama # Linux curl -fsSL https://ollama.com/install.sh | sh # Docker docker run -d -v ollama:/root/.ollama -p 11434:11434 ollama/ollama # Custom Modelfile cat << EOF > Modelfile FROM llama3.1 SYSTEM "You are a helpful coding assistant." PARAMETER temperature 0.7 PARAMETER num_ctx 8192 EOF ollama create my-assistant -f Modelfile # REST API usage curl http://localhost:11434/api/chat -d '{ "model": "llama3.1", "messages": [{"role": "user", "content": "Hello"}] }' ``` ## Key Features - **One Command Setup** — install and run any model with a single command - **GPU Acceleration** — automatic detection for NVIDIA CUDA, AMD ROCm, Apple Metal - **Model Library** — hundreds of models from 1B to 405B parameters - **REST API** — OpenAI-compatible API on localhost:11434 - **Modelfile** — customize models with system prompts, parameters, and adapters - **Concurrent Requests** — serve multiple users and models simultaneously - **Quantization** — run large models on consumer hardware via GGUF quantization - **Vision Models** — supports multimodal models like LLaVA and Llama 3.2 Vision ## Comparison with Similar Tools | Feature | Ollama | LM Studio | llama.cpp | vLLM | LocalAI | |---|---|---|---|---|---| | Ease of Use | Very Easy | Very Easy | Technical | Technical | Moderate | | GUI | No (CLI) | Yes | No | No | No | | API Server | Built-in | Built-in | Optional | Built-in | Built-in | | Model Library | Curated | HuggingFace | Manual | HuggingFace | Multiple | | GPU Support | Auto-detect | Auto-detect | Manual | NVIDIA only | Multiple | | Docker Support | Yes | No | Community | Yes | Yes | | Custom Models | Modelfile | Import | Manual | Config | Config | | GitHub Stars | 169K | 53K | 78K | 40K | 28K | ## FAQ **Q: What hardware do I need to run LLMs with Ollama?** A: For 7B models, 8GB RAM minimum. For 13B models, 16GB RAM. For 70B models, 64GB RAM or a GPU with 48GB VRAM. Apple Silicon Macs with 16GB+ unified memory work well for most models. **Q: How does Ollama compare to cloud APIs like OpenAI?** A: Ollama runs 100% locally — no API costs, no data leaving your machine, no rate limits. Trade-offs: you need capable hardware, and local models may be less capable than the latest cloud models for some tasks. **Q: Can I use Ollama with my existing tools?** A: Yes. Ollama provides an OpenAI-compatible API, so tools that work with OpenAI (LangChain, AutoGen, Continue, etc.) can point to localhost:11434 instead. **Q: How do I run multiple models simultaneously?** A: Ollama automatically manages model loading and unloading based on available memory. You can make requests to different models and Ollama handles context switching. ## Sources - GitHub: https://github.com/ollama/ollama - Documentation: https://ollama.com - Model Library: https://ollama.com/library - License: MIT --- Source: https://tokrepo.com/en/workflows/0f835fd8-366d-11f1-9bc6-00163e2b0d79 Author: Script Depot