# Llamafile — Run AI Models as Single Executables

> Package and run LLMs as single portable executables. Llamafile bundles model weights with llama.cpp into one file that runs on any OS without installation.

## Install

Save as a script file and run:

## Quick Use

```bash
# Download a llamafile (model + runtime in one file)
curl -LO https://huggingface.co/Mozilla/llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile
# Opens browser at http://localhost:8080 — ready to chat
```

## What is Llamafile?

Llamafile packages LLMs into single executable files that run on any operating system. Built on llama.cpp and Cosmopolitan Libc, a llamafile is one file that contains both the model weights and inference engine. Download, make executable, run — no Python, no Docker, no dependencies. It works on Windows, macOS, Linux, FreeBSD, and even OpenBSD.

**Answer-Ready**: Llamafile packages LLMs into single portable executables. One file runs on any OS — no Python, no Docker, no dependencies. Built by Mozilla on llama.cpp + Cosmopolitan Libc. Includes web UI and OpenAI-compatible API. 22k+ GitHub stars.

**Best for**: Developers wanting zero-setup local AI inference. **Works with**: Any OpenAI-compatible tool, Claude Code (as local backend). **Setup time**: Under 1 minute.

## Core Features

### 1. Zero Dependencies

```bash
# That's it. No pip, no conda, no brew.
./mistral-7b.llamafile --server --port 8080
```

### 2. OpenAI-Compatible API

```python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
response = client.chat.completions.create(
    model="local",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)
```

### 3. Build Your Own Llamafile

```bash
# Package any GGUF model into a llamafile
llamafile-pack -o my-model.llamafile my-model.gguf
```

### 4. GPU Acceleration

| Platform | Acceleration |
|----------|-------------|
| NVIDIA | CUDA (auto-detected) |
| Apple Silicon | Metal (auto-detected) |
| AMD | ROCm support |
| CPU | AVX/AVX2/AVX-512 |

## Llamafile vs Alternatives

| Feature | Llamafile | Ollama | Jan | LM Studio |
|---------|-----------|--------|-----|-----------|
| Single file | Yes | No (service) | No (app) | No (app) |
| No dependencies | Yes | Docker/binary | Electron | Electron |
| Cross-OS portable | Yes (same file) | Per-OS binary | Per-OS app | Per-OS app |
| Web UI included | Yes | No | Yes | Yes |
| API | OpenAI-compat | OpenAI-compat | OpenAI-compat | OpenAI-compat |

## FAQ

**Q: How big are llamafiles?**
A: Same as the model weights — a 7B Q4 model is ~4GB. The runtime adds <10MB overhead.

**Q: Can I use GPU acceleration?**
A: Yes, CUDA and Metal are auto-detected. Pass `--n-gpu-layers 999` to offload all layers.

**Q: Who maintains it?**
A: Mozilla's Innovation team, built by Justine Tunney (creator of Cosmopolitan Libc).

## Source & Thanks

> Created by [Mozilla](https://github.com/Mozilla-Ocho). Licensed under Apache 2.0.
>
> [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) — 22k+ stars

<!-- ZH -->

## 快速使用

```bash
curl -LO <model>.llamafile && chmod +x && ./
```

下载一个文件，直接运行 AI 模型，无需安装任何依赖。

## 什么是 Llamafile？

Llamafile 将 LLM 打包为单个可执行文件，任何操作系统直接运行。Mozilla 出品，基于 llama.cpp + Cosmopolitan Libc。

**一句话总结**：将 LLM 打包为单文件可执行程序，跨平台零依赖运行，内置 Web UI 和 OpenAI 兼容 API，Mozilla 出品，22k+ stars。

**适合人群**：需要零配置本地 AI 推理的开发者。

## 核心功能

### 1. 零依赖
一个文件，无需 Python/Docker/包管理器。

### 2. 跨平台
同一文件在 Windows/macOS/Linux 运行。

### 3. GPU 加速
自动检测 CUDA/Metal。

## 常见问题

**Q: 文件多大？**
A: 等于模型权重大小，7B Q4 约 4GB。

**Q: 谁维护？**
A: Mozilla 创新团队。

## 来源与致谢

> [Mozilla-Ocho/llamafile](https://github.com/Mozilla-Ocho/llamafile) — 22k+ stars, Apache 2.0

---
Source: https://tokrepo.com/en/workflows/83ea12ae-8576-474f-b1ec-4ddbe0dd1804
Author: AI Open Source