ConfigsMay 1, 2026·3 min read

Petals — Run LLMs at Home BitTorrent-Style

A decentralized system for running large language models collaboratively across consumer hardware. Distributes model layers across peers for inference and fine-tuning.

Introduction

Petals enables running 100B+ parameter language models by splitting them across multiple consumer-grade machines connected over the internet. Inspired by BitTorrent, each participant hosts a subset of model layers while the system routes inference through available peers, making large-scale models accessible without enterprise hardware.

What Petals Does

  • Distributes large language model layers across multiple peers on the internet
  • Enables inference on 100B+ parameter models using commodity GPUs
  • Supports fine-tuning via parameter-efficient methods like adapters and prompt tuning
  • Provides a Hugging Face-compatible API for drop-in integration
  • Runs both public swarms and private clusters for controlled deployments

Architecture Overview

Petals partitions a model's Transformer layers across a network of servers. When a client sends a request, the system routes hidden states sequentially through peers hosting consecutive layer ranges. A DHT-based routing protocol discovers available servers and balances load. Each peer only needs enough GPU memory for its assigned layers, so a 176B parameter model can run across a handful of consumer GPUs.

Self-Hosting & Configuration

  • Install via pip: pip install petals on Python 3.8+
  • Run a server with python -m petals.cli.run_server bigscience/bloom --num_blocks 12
  • Each server hosts a configurable number of Transformer blocks based on available VRAM
  • Join the public swarm automatically or configure a private swarm with --initial_peers
  • Monitor server health and swarm status via the Petals health dashboard

Key Features

  • Run 100B+ models on hardware that could never fit them locally
  • Up to 10x faster than offloading-based approaches for distributed inference
  • Fine-tune with LoRA or prompt tuning across the distributed network
  • Fault-tolerant routing automatically reroutes around offline peers
  • Compatible with Hugging Face generate API and chat templates

Comparison with Similar Tools

  • llama.cpp — optimized single-machine inference; Petals distributes across many machines for models that exceed local capacity
  • vLLM — high-throughput serving on a single node or cluster; Petals targets volunteer-style distributed setups
  • Ollama — simplified local LLM experience; Petals handles models too large for any single machine
  • ExLlamaV2 — quantized inference for fitting models on one GPU; Petals runs full-precision across many GPUs
  • Together AI — managed distributed inference; Petals is self-hosted and free

FAQ

Q: How fast is inference compared to running the full model locally? A: Latency depends on network speed between peers. On a well-connected swarm, generation is interactive (a few tokens per second for large models), though slower than dedicated hardware.

Q: What models are supported? A: Petals supports most Hugging Face Transformer models. The public swarm typically hosts BLOOM and Llama variants. Private swarms can host any model.

Q: Is my data private when using the public swarm? A: Intermediate activations pass through other participants' machines. For sensitive data, run a private swarm with trusted peers.

Q: Can I contribute GPU time without running inference myself? A: Yes. Run the server command to donate your GPU to the public swarm. You help others run models while earning no direct cost.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets