为什么选它
MLX is Apple’s answer to PyTorch and JAX, built from scratch for Apple Silicon. Where other frameworks treat Apple GPUs as peripheral targets, MLX is designed around unified memory — the CPU and GPU share the same physical memory, so there’s no copy overhead between them. For inference that’s a meaningful performance win.
The practical consequence: on M3 Max or M4 Pro, MLX-LM runs Llama 3.3 70B at 25-35 tokens/sec where llama.cpp Metal tops out around 20-25 tokens/sec on the same hardware. For users with serious Apple Silicon (especially M3/M4 Max or Ultra with 64GB+ unified memory), MLX is the difference between "usable" and "pleasant" for large-model local inference.
The cost: ecosystem is smaller than llama.cpp. Fewer pre-quantized models, fewer integrations, and Apple-only. The mlx-community HuggingFace org publishes most popular models in MLX format; if a model you want isn’t there, you can convert with one command. For Apple-only teams who care about speed, MLX is worth the ecosystem trade.
Quick Start — MLX-LM Chat and Server
mlx_lm.server exposes OpenAI-compatible chat/completions/embeddings endpoints. mlx_lm.convert pulls any HF model and writes an MLX version (optionally quantized). The mlx-community org already publishes popular models pre-converted — start there.
# 1. Install MLX-LM (Python 3.9+, Apple Silicon only)
pip install mlx-lm
# 2. Run a chat completion from the CLI
# Models pulled from mlx-community HuggingFace org on first use.
mlx_lm.generate \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--prompt "Name the three M-series chip families." \
--max-tokens 200
# 3. Start an OpenAI-compatible server
mlx_lm.server --model mlx-community/Llama-3.2-3B-Instruct-4bit \
--host 0.0.0.0 --port 8080
# 4. Call the server with any OpenAI SDK
python - <<'PY'
from openai import OpenAI
c = OpenAI(base_url="http://localhost:8080/v1", api_key="mlx")
r = c.chat.completions.create(
model="default",
messages=[{"role":"user","content":"What is unified memory?"}],
)
print(r.choices[0].message.content)
PY
# 5. Convert any HuggingFace model to MLX format with one command
mlx_lm.convert --hf-path Qwen/Qwen2.5-7B-Instruct -q # -q = 4-bit quantize核心能力
Unified memory native
Tensors live in memory shared between CPU and GPU. No copies between host and device. Cuts memory use and latency for inference.
Lazy evaluation + JIT
Operations are lazy; MLX builds a graph and compiles at evaluation time. Lets the framework fuse ops for better performance without manual optimization.
PyTorch-like API
mlx.core and mlx.nn mirror PyTorch’s tensor and module APIs closely. Porting a PyTorch model is usually a few find/replace operations.
MLX-LM package
Dedicated library for LLM inference, training, fine-tuning (LoRA/QLoRA), and an OpenAI-compatible server. Covers the common workflows end-to-end on Apple Silicon.
Quantization support
4-bit and 8-bit quantization with minimal quality loss. mlx_lm.convert -q handles conversion from safetensors to quantized MLX format automatically.
Multimodal models
MLX-VLM library extends MLX-LM to vision-language models (LLaVA, Qwen-VL, Gemma vision). Same ergonomics, same Apple Silicon performance.
对比
| Hardware | Peak Speed | Ecosystem | Best For | |
|---|---|---|---|---|
| MLXthis | Apple Silicon only | Fastest on Apple Silicon | Small but active | Mac users, Apple-only teams |
| llama.cpp Metal | Apple Silicon + Intel Mac + CPU | Fast | Huge | Mac + cross-platform |
| Ollama | Cross-platform | Good (llama.cpp backend) | Very large | Developer ergonomics |
| PyTorch MPS | Apple Silicon | Medium | Huge (via PyTorch) | Research / training |
实际用例
01. Max-performance Mac inference
M3 Max / M4 Max / M Ultra users running 30-70B models locally. MLX gets 20-40% more tokens/sec than llama.cpp on the same hardware — noticeable in interactive use.
02. LoRA fine-tuning on a MacBook
mlx-lm.lora lets you fine-tune small LoRAs on 32GB+ Macs. Not as fast as a dedicated GPU rig, but accessible — no cloud bill, no Linux machine.
03. Apple-first AI research
Researchers who live on Mac and want a native, low-overhead framework. MLX’s PyTorch-like API eases the transition.
价格与许可
MLX / MLX-LM: MIT open source, maintained by Apple’s ML team. Free.
Hardware: only useful on Apple Silicon (M1, M2, M3, M4). The more unified memory, the larger the models you can run — a 64GB MacBook comfortably handles 70B 4-bit quants.
Ecosystem: smaller than llama.cpp. Models published via mlx-community on HuggingFace; convert your own from PyTorch weights with mlx_lm.convert.
相关 TokRepo 资产
常见问题
MLX vs llama.cpp on Mac?+
MLX is usually 20-40% faster on Apple Silicon for the same model, especially large ones. llama.cpp has a larger ecosystem and cross-platform portability. If you live on Mac and care about speed, MLX wins; if you need the same tooling across Mac/Linux/Windows, llama.cpp wins.
Does MLX work on Intel Macs?+
No — Apple Silicon only (M1 and newer). For Intel Mac, use llama.cpp with Metal or CPU backends.
Can I use MLX models with Ollama or LM Studio?+
LM Studio has experimental MLX support (v0.3+). Ollama does not ship MLX natively as of 2026 — it stays llama.cpp-based. To expose MLX via Ollama-compatible API, run mlx_lm.server and point clients at it directly.
How does MLX compare to Apple’s Core ML?+
Core ML is Apple’s production ML inference runtime (shipped with iOS/macOS apps). MLX is more experimental and researcher-facing — PyTorch-like, flexible, with full training support. For deploying an LLM in a production Mac app, Core ML + Apple Intelligence is more typical; for interactive inference and fine-tuning, MLX is the tool.
Is MLX production-ready?+
For inference: yes, widely used. For training/fine-tuning: usable but less mature than PyTorch. API is stable; expect faster iteration than PyTorch since Apple controls the roadmap and ships frequently.
Where do I find MLX models?+
huggingface.co/mlx-community — community-maintained MLX conversions of popular open LLMs. Most major models (Llama 3.x, Qwen 2.5, Mistral, Gemma, DeepSeek, Phi) have MLX versions within days of release.