Introduction
Petals enables running 100B+ parameter language models by splitting them across multiple consumer-grade machines connected over the internet. Inspired by BitTorrent, each participant hosts a subset of model layers while the system routes inference through available peers, making large-scale models accessible without enterprise hardware.
What Petals Does
- Distributes large language model layers across multiple peers on the internet
- Enables inference on 100B+ parameter models using commodity GPUs
- Supports fine-tuning via parameter-efficient methods like adapters and prompt tuning
- Provides a Hugging Face-compatible API for drop-in integration
- Runs both public swarms and private clusters for controlled deployments
Architecture Overview
Petals partitions a model's Transformer layers across a network of servers. When a client sends a request, the system routes hidden states sequentially through peers hosting consecutive layer ranges. A DHT-based routing protocol discovers available servers and balances load. Each peer only needs enough GPU memory for its assigned layers, so a 176B parameter model can run across a handful of consumer GPUs.
Self-Hosting & Configuration
- Install via pip:
pip install petalson Python 3.8+ - Run a server with
python -m petals.cli.run_server bigscience/bloom --num_blocks 12 - Each server hosts a configurable number of Transformer blocks based on available VRAM
- Join the public swarm automatically or configure a private swarm with
--initial_peers - Monitor server health and swarm status via the Petals health dashboard
Key Features
- Run 100B+ models on hardware that could never fit them locally
- Up to 10x faster than offloading-based approaches for distributed inference
- Fine-tune with LoRA or prompt tuning across the distributed network
- Fault-tolerant routing automatically reroutes around offline peers
- Compatible with Hugging Face generate API and chat templates
Comparison with Similar Tools
- llama.cpp — optimized single-machine inference; Petals distributes across many machines for models that exceed local capacity
- vLLM — high-throughput serving on a single node or cluster; Petals targets volunteer-style distributed setups
- Ollama — simplified local LLM experience; Petals handles models too large for any single machine
- ExLlamaV2 — quantized inference for fitting models on one GPU; Petals runs full-precision across many GPUs
- Together AI — managed distributed inference; Petals is self-hosted and free
FAQ
Q: How fast is inference compared to running the full model locally? A: Latency depends on network speed between peers. On a well-connected swarm, generation is interactive (a few tokens per second for large models), though slower than dedicated hardware.
Q: What models are supported? A: Petals supports most Hugging Face Transformer models. The public swarm typically hosts BLOOM and Llama variants. Private swarms can host any model.
Q: Is my data private when using the public swarm? A: Intermediate activations pass through other participants' machines. For sensitive data, run a private swarm with trusted peers.
Q: Can I contribute GPU time without running inference myself? A: Yes. Run the server command to donate your GPU to the public swarm. You help others run models while earning no direct cost.