# Petals — Run LLMs at Home BitTorrent-Style

> A decentralized system for running large language models collaboratively across consumer hardware. Distributes model layers across peers for inference and fine-tuning.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

# Petals — Run LLMs at Home BitTorrent-Style

## Quick Use
```bash
pip install petals
python -c "
from petals import AutoDistributedModelForCausalLM
from transformers import AutoTokenizer
model = AutoDistributedModelForCausalLM.from_pretrained('bigscience/bloom')
tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom')
inputs = tokenizer('Hello, world', return_tensors='pt')
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=5)[0]))
"
```

## Introduction
Petals enables running 100B+ parameter language models by splitting them across multiple consumer-grade machines connected over the internet. Inspired by BitTorrent, each participant hosts a subset of model layers while the system routes inference through available peers, making large-scale models accessible without enterprise hardware.

## What Petals Does
- Distributes large language model layers across multiple peers on the internet
- Enables inference on 100B+ parameter models using commodity GPUs
- Supports fine-tuning via parameter-efficient methods like adapters and prompt tuning
- Provides a Hugging Face-compatible API for drop-in integration
- Runs both public swarms and private clusters for controlled deployments

## Architecture Overview
Petals partitions a model's Transformer layers across a network of servers. When a client sends a request, the system routes hidden states sequentially through peers hosting consecutive layer ranges. A DHT-based routing protocol discovers available servers and balances load. Each peer only needs enough GPU memory for its assigned layers, so a 176B parameter model can run across a handful of consumer GPUs.

## Self-Hosting & Configuration
- Install via pip: `pip install petals` on Python 3.8+
- Run a server with `python -m petals.cli.run_server bigscience/bloom --num_blocks 12`
- Each server hosts a configurable number of Transformer blocks based on available VRAM
- Join the public swarm automatically or configure a private swarm with `--initial_peers`
- Monitor server health and swarm status via the Petals health dashboard

## Key Features
- Run 100B+ models on hardware that could never fit them locally
- Up to 10x faster than offloading-based approaches for distributed inference
- Fine-tune with LoRA or prompt tuning across the distributed network
- Fault-tolerant routing automatically reroutes around offline peers
- Compatible with Hugging Face generate API and chat templates

## Comparison with Similar Tools
- **llama.cpp** — optimized single-machine inference; Petals distributes across many machines for models that exceed local capacity
- **vLLM** — high-throughput serving on a single node or cluster; Petals targets volunteer-style distributed setups
- **Ollama** — simplified local LLM experience; Petals handles models too large for any single machine
- **ExLlamaV2** — quantized inference for fitting models on one GPU; Petals runs full-precision across many GPUs
- **Together AI** — managed distributed inference; Petals is self-hosted and free

## FAQ
**Q: How fast is inference compared to running the full model locally?**
A: Latency depends on network speed between peers. On a well-connected swarm, generation is interactive (a few tokens per second for large models), though slower than dedicated hardware.

**Q: What models are supported?**
A: Petals supports most Hugging Face Transformer models. The public swarm typically hosts BLOOM and Llama variants. Private swarms can host any model.

**Q: Is my data private when using the public swarm?**
A: Intermediate activations pass through other participants' machines. For sensitive data, run a private swarm with trusted peers.

**Q: Can I contribute GPU time without running inference myself?**
A: Yes. Run the server command to donate your GPU to the public swarm. You help others run models while earning no direct cost.

## Sources
- https://github.com/bigscience-workshop/petals
- https://petals.dev/

---
Source: https://tokrepo.com/en/workflows/petals-run-llms-home-bittorrent-style-98cbd290
Author: AI Open Source