Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 1, 2026·3 min de lecture

Petals — Run LLMs at Home BitTorrent-Style

A decentralized system for running large language models collaboratively across consumer hardware. Distributes model layers across peers for inference and fine-tuning.

Introduction

Petals enables running 100B+ parameter language models by splitting them across multiple consumer-grade machines connected over the internet. Inspired by BitTorrent, each participant hosts a subset of model layers while the system routes inference through available peers, making large-scale models accessible without enterprise hardware.

What Petals Does

  • Distributes large language model layers across multiple peers on the internet
  • Enables inference on 100B+ parameter models using commodity GPUs
  • Supports fine-tuning via parameter-efficient methods like adapters and prompt tuning
  • Provides a Hugging Face-compatible API for drop-in integration
  • Runs both public swarms and private clusters for controlled deployments

Architecture Overview

Petals partitions a model's Transformer layers across a network of servers. When a client sends a request, the system routes hidden states sequentially through peers hosting consecutive layer ranges. A DHT-based routing protocol discovers available servers and balances load. Each peer only needs enough GPU memory for its assigned layers, so a 176B parameter model can run across a handful of consumer GPUs.

Self-Hosting & Configuration

  • Install via pip: pip install petals on Python 3.8+
  • Run a server with python -m petals.cli.run_server bigscience/bloom --num_blocks 12
  • Each server hosts a configurable number of Transformer blocks based on available VRAM
  • Join the public swarm automatically or configure a private swarm with --initial_peers
  • Monitor server health and swarm status via the Petals health dashboard

Key Features

  • Run 100B+ models on hardware that could never fit them locally
  • Up to 10x faster than offloading-based approaches for distributed inference
  • Fine-tune with LoRA or prompt tuning across the distributed network
  • Fault-tolerant routing automatically reroutes around offline peers
  • Compatible with Hugging Face generate API and chat templates

Comparison with Similar Tools

  • llama.cpp — optimized single-machine inference; Petals distributes across many machines for models that exceed local capacity
  • vLLM — high-throughput serving on a single node or cluster; Petals targets volunteer-style distributed setups
  • Ollama — simplified local LLM experience; Petals handles models too large for any single machine
  • ExLlamaV2 — quantized inference for fitting models on one GPU; Petals runs full-precision across many GPUs
  • Together AI — managed distributed inference; Petals is self-hosted and free

FAQ

Q: How fast is inference compared to running the full model locally? A: Latency depends on network speed between peers. On a well-connected swarm, generation is interactive (a few tokens per second for large models), though slower than dedicated hardware.

Q: What models are supported? A: Petals supports most Hugging Face Transformer models. The public swarm typically hosts BLOOM and Llama variants. Private swarms can host any model.

Q: Is my data private when using the public swarm? A: Intermediate activations pass through other participants' machines. For sensitive data, run a private swarm with trusted peers.

Q: Can I contribute GPU time without running inference myself? A: Yes. Run the server command to donate your GPU to the public swarm. You help others run models while earning no direct cost.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires