Local LLM

MLX — 苹果为 Apple Silicon 打造的机器学习框架

MLX 是苹果开源的机器学习框架，专为 Apple Silicon 的统一内存架构设计。MLX-LM 在 M 系列 Mac 上是最快的 LLM 推理方案。

Why MLX

MLX is Apple’s answer to PyTorch and JAX, built from scratch for Apple Silicon. Where other frameworks treat Apple GPUs as peripheral targets, MLX is designed around unified memory — the CPU and GPU share the same physical memory, so there’s no copy overhead between them. For inference that’s a meaningful performance win.

The practical consequence: on M3 Max or M4 Pro, MLX-LM runs Llama 3.3 70B at 25-35 tokens/sec where llama.cpp Metal tops out around 20-25 tokens/sec on the same hardware. For users with serious Apple Silicon (especially M3/M4 Max or Ultra with 64GB+ unified memory), MLX is the difference between "usable" and "pleasant" for large-model local inference.

The cost: ecosystem is smaller than llama.cpp. Fewer pre-quantized models, fewer integrations, and Apple-only. The mlx-community HuggingFace org publishes most popular models in MLX format; if a model you want isn’t there, you can convert with one command. For Apple-only teams who care about speed, MLX is worth the ecosystem trade.

Quick Start — MLX-LM Chat and Server

mlx_lm.server exposes OpenAI-compatible chat/completions/embeddings endpoints. mlx_lm.convert pulls any HF model and writes an MLX version (optionally quantized). The mlx-community org already publishes popular models pre-converted — start there.

# 1. Install MLX-LM (Python 3.9+, Apple Silicon only)
pip install mlx-lm

# 2. Run a chat completion from the CLI
#    Models pulled from mlx-community HuggingFace org on first use.
mlx_lm.generate \
  --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --prompt "Name the three M-series chip families." \
  --max-tokens 200

# 3. Start an OpenAI-compatible server
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit \
  --host 0.0.0.0 --port 8080

# 4. Call the server with any OpenAI SDK
python - <<'PY'
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8080/v1", api_key="mlx")
r = c.chat.completions.create(
    model="default",
    messages=[{"role":"user","content":"What is unified memory?"}],
)
print(r.choices[0].message.content)
PY

# 5. Convert any HuggingFace model to MLX format with one command
mlx_lm.convert --hf-path Qwen/Qwen2.5-7B-Instruct -q    # -q = 4-bit quantize

核心能力

Unified memory native

Tensors live in memory shared between CPU and GPU. No copies between host and device. Cuts memory use and latency for inference.

Lazy evaluation + JIT

Operations are lazy; MLX builds a graph and compiles at evaluation time. Lets the framework fuse ops for better performance without manual optimization.

PyTorch-like API

mlx.core and mlx.nn mirror PyTorch’s tensor and module APIs closely. Porting a PyTorch model is usually a few find/replace operations.

MLX-LM package

Dedicated library for LLM inference, training, fine-tuning (LoRA/QLoRA), and an OpenAI-compatible server. Covers the common workflows end-to-end on Apple Silicon.

Quantization support

4-bit and 8-bit quantization with minimal quality loss. mlx_lm.convert -q handles conversion from safetensors to quantized MLX format automatically.

Multimodal models

MLX-VLM library extends MLX-LM to vision-language models (LLaVA, Qwen-VL, Gemma vision). Same ergonomics, same Apple Silicon performance.

对比

	Hardware	Peak Speed	Ecosystem	Best For
MLX本工具	Apple Silicon only	Fastest on Apple Silicon	Small but active	Mac users, Apple-only teams
llama.cpp Metal	Apple Silicon + Intel Mac + CPU	Fast	Huge	Mac + cross-platform
Ollama	Cross-platform	Good (llama.cpp backend)	Very large	Developer ergonomics
PyTorch MPS	Apple Silicon	Medium	Huge (via PyTorch)	Research / training

实际用例

01. Max-performance Mac inference

M3 Max / M4 Max / M Ultra users running 30-70B models locally. MLX gets 20-40% more tokens/sec than llama.cpp on the same hardware — noticeable in interactive use.

02. LoRA fine-tuning on a MacBook

mlx-lm.lora lets you fine-tune small LoRAs on 32GB+ Macs. Not as fast as a dedicated GPU rig, but accessible — no cloud bill, no Linux machine.

03. Apple-first AI research

Researchers who live on Mac and want a native, low-overhead framework. MLX’s PyTorch-like API eases the transition.

价格与许可

MLX / MLX-LM: MIT open source, maintained by Apple’s ML team. Free.

Hardware: only useful on Apple Silicon (M1, M2, M3, M4). The more unified memory, the larger the models you can run — a 64GB MacBook comfortably handles 70B 4-bit quants.

Ecosystem: smaller than llama.cpp. Models published via mlx-community on HuggingFace; convert your own from PyTorch weights with mlx_lm.convert.

常见问题

MLX vs llama.cpp on Mac?+

MLX is usually 20-40% faster on Apple Silicon for the same model, especially large ones. llama.cpp has a larger ecosystem and cross-platform portability. If you live on Mac and care about speed, MLX wins; if you need the same tooling across Mac/Linux/Windows, llama.cpp wins.

Does MLX work on Intel Macs?+

No — Apple Silicon only (M1 and newer). For Intel Mac, use llama.cpp with Metal or CPU backends.

Can I use MLX models with Ollama or LM Studio?+

LM Studio has experimental MLX support (v0.3+). Ollama does not ship MLX natively as of 2026 — it stays llama.cpp-based. To expose MLX via Ollama-compatible API, run mlx_lm.server and point clients at it directly.

How does MLX compare to Apple’s Core ML?+

Core ML is Apple’s production ML inference runtime (shipped with iOS/macOS apps). MLX is more experimental and researcher-facing — PyTorch-like, flexible, with full training support. For deploying an LLM in a production Mac app, Core ML + Apple Intelligence is more typical; for interactive inference and fine-tuning, MLX is the tool.

Is MLX production-ready?+

For inference: yes, widely used. For training/fine-tuning: usable but less mature than PyTorch. API is stable; expect faster iteration than PyTorch since Apple controls the roadmap and ships frequently.

Where do I find MLX models?+

huggingface.co/mlx-community — community-maintained MLX conversions of popular open LLMs. Most major models (Llama 3.x, Qwen 2.5, Mistral, Gemma, DeepSeek, Phi) have MLX versions within days of release.