# Petals — Run LLMs at Home BitTorrent-Style > A decentralized system for running large language models collaboratively across consumer hardware. Distributes model layers across peers for inference and fine-tuning. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: # Petals — Run LLMs at Home BitTorrent-Style ## Quick Use ```bash pip install petals python -c " from petals import AutoDistributedModelForCausalLM from transformers import AutoTokenizer model = AutoDistributedModelForCausalLM.from_pretrained('bigscience/bloom') tokenizer = AutoTokenizer.from_pretrained('bigscience/bloom') inputs = tokenizer('Hello, world', return_tensors='pt') print(tokenizer.decode(model.generate(**inputs, max_new_tokens=5)[0])) " ``` ## Introduction Petals enables running 100B+ parameter language models by splitting them across multiple consumer-grade machines connected over the internet. Inspired by BitTorrent, each participant hosts a subset of model layers while the system routes inference through available peers, making large-scale models accessible without enterprise hardware. ## What Petals Does - Distributes large language model layers across multiple peers on the internet - Enables inference on 100B+ parameter models using commodity GPUs - Supports fine-tuning via parameter-efficient methods like adapters and prompt tuning - Provides a Hugging Face-compatible API for drop-in integration - Runs both public swarms and private clusters for controlled deployments ## Architecture Overview Petals partitions a model's Transformer layers across a network of servers. When a client sends a request, the system routes hidden states sequentially through peers hosting consecutive layer ranges. A DHT-based routing protocol discovers available servers and balances load. Each peer only needs enough GPU memory for its assigned layers, so a 176B parameter model can run across a handful of consumer GPUs. ## Self-Hosting & Configuration - Install via pip: `pip install petals` on Python 3.8+ - Run a server with `python -m petals.cli.run_server bigscience/bloom --num_blocks 12` - Each server hosts a configurable number of Transformer blocks based on available VRAM - Join the public swarm automatically or configure a private swarm with `--initial_peers` - Monitor server health and swarm status via the Petals health dashboard ## Key Features - Run 100B+ models on hardware that could never fit them locally - Up to 10x faster than offloading-based approaches for distributed inference - Fine-tune with LoRA or prompt tuning across the distributed network - Fault-tolerant routing automatically reroutes around offline peers - Compatible with Hugging Face generate API and chat templates ## Comparison with Similar Tools - **llama.cpp** — optimized single-machine inference; Petals distributes across many machines for models that exceed local capacity - **vLLM** — high-throughput serving on a single node or cluster; Petals targets volunteer-style distributed setups - **Ollama** — simplified local LLM experience; Petals handles models too large for any single machine - **ExLlamaV2** — quantized inference for fitting models on one GPU; Petals runs full-precision across many GPUs - **Together AI** — managed distributed inference; Petals is self-hosted and free ## FAQ **Q: How fast is inference compared to running the full model locally?** A: Latency depends on network speed between peers. On a well-connected swarm, generation is interactive (a few tokens per second for large models), though slower than dedicated hardware. **Q: What models are supported?** A: Petals supports most Hugging Face Transformer models. The public swarm typically hosts BLOOM and Llama variants. Private swarms can host any model. **Q: Is my data private when using the public swarm?** A: Intermediate activations pass through other participants' machines. For sensitive data, run a private swarm with trusted peers. **Q: Can I contribute GPU time without running inference myself?** A: Yes. Run the server command to donate your GPU to the public swarm. You help others run models while earning no direct cost. ## Sources - https://github.com/bigscience-workshop/petals - https://petals.dev/ --- Source: https://tokrepo.com/en/workflows/petals-run-llms-home-bittorrent-style-98cbd290 Author: AI Open Source