Core APIs
| API | Description |
|---|---|
| Inference | Chat completion, text generation, embeddings |
| Safety | Content moderation with Llama Guard / Prompt Guard |
| Agents | Multi-step agentic workflows with tool use and memory |
| RAG | Document ingestion, vector search, contextual retrieval |
| Eval | Benchmark and evaluate model quality |
| Memory | Persistent memory banks for agent context |
| Tool Use | Web search, code execution, Wolfram Alpha, custom tools |
Distribution Providers
Run anywhere with pluggable backends:
- Local: Ollama, vLLM, TGI
- Cloud: Together, Fireworks, AWS Bedrock, NVIDIA NIM
- On-device: Qualcomm, MediaTek, PyTorch ExecuTorch
FAQ
Q: What is Llama Stack? A: Meta's official framework for building LLM apps with Llama models. Provides standardized APIs for inference, safety, RAG, agents, and evals. 8.3K+ stars, MIT licensed.
Q: Can I use Llama Stack with non-Llama models? A: Llama Stack is designed for Llama models, but inference providers like Ollama and vLLM can serve other models through the same API.