Cette page est affichée en anglais. Une traduction française est en cours.
ConfigsMay 24, 2026·3 min de lecture

LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs

An open-source inference server by Predibase that serves thousands of fine-tuned LoRA adapters on a single base model with shared GPU memory.

Prêt pour agents

Staging sûr pour cet actif

Cet actif est d'abord staged. Le prompt copié demande à l'agent d'inspecter les fichiers staged avant d'activer scripts, config MCP ou config globale.

Stage only · 29/100Policy : staging
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Stage only
Confiance
Confiance : Established
Point d'entrée
LoRAX
Commande de staging sûr
npx -y tokrepo@latest install 1ff25e23-57ae-11f1-9bc6-00163e2b0d79 --target codex

Stage les fichiers d'abord; l'activation exige la revue du README et du plan staged.

Introduction

LoRAX (LoRA eXchange) is an open-source multi-LoRA inference server that can serve thousands of fine-tuned LoRA adapters simultaneously from a single shared base model. By dynamically loading and unloading adapters on demand, LoRAX eliminates the need to deploy separate model instances for each fine-tuned variant, dramatically reducing GPU costs.

What LoRAX Does

  • Serves multiple LoRA adapters from one base model deployment
  • Dynamically loads adapters from Hugging Face Hub or local storage on request
  • Implements continuous batching across different adapters in the same batch
  • Manages adapter lifecycle with LRU caching and preloading
  • Exposes OpenAI-compatible and native REST APIs

Architecture Overview

LoRAX extends the text-generation-inference (TGI) architecture with a multi-adapter scheduler. The base model weights stay in GPU memory permanently, while LoRA adapter weights are loaded into a dedicated adapter cache. During batching, requests targeting different adapters are grouped together, with adapter-specific computations applied per-request during the forward pass using custom CUDA kernels.

Self-Hosting & Configuration

  • Deploy via Docker with NVIDIA GPU support
  • Specify the base model ID from Hugging Face on startup
  • Adapters load dynamically via the adapter_id request parameter
  • Configure adapter cache size based on available GPU memory
  • Supports quantized base models (AWQ, GPTQ, bitsandbytes) for reduced memory

Key Features

  • Serve 1000+ adapters from a single GPU deployment
  • Hot-swap adapters without server restart or downtime
  • Heterogeneous batching mixes requests for different adapters
  • Adapter preloading for latency-sensitive use cases
  • Compatible with any Hugging Face LoRA/PEFT adapter

Comparison with Similar Tools

  • vLLM — general LLM serving; LoRAX specializes in multi-adapter serving with shared base models
  • TGI (Text Generation Inference) — single-model focus; LoRAX extends it with multi-LoRA support
  • SGLang — prefix caching; LoRAX focuses on adapter multiplexing
  • Ollama — local single-model use; LoRAX targets multi-tenant production serving

FAQ

Q: How many adapters can LoRAX serve simultaneously? A: Depends on adapter size and GPU memory. Typical rank-16 adapters use roughly 30 MB each, so a 24 GB GPU can cache hundreds.

Q: Does serving multiple adapters slow down inference? A: Minimally. The adapter computation adds small overhead per token. Batching across adapters amortizes the base model cost.

Q: Can I train adapters with LoRAX? A: No. LoRAX is inference-only. Train adapters with PEFT, Axolotl, or Unsloth, then serve them with LoRAX.

Q: Which base models are supported? A: Llama, Mistral, Qwen, Gemma, Phi, and most decoder-only Hugging Face architectures.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires