ConfigsMay 24, 2026·3 min read

LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs

An open-source inference server by Predibase that serves thousands of fine-tuned LoRA adapters on a single base model with shared GPU memory.

Agent ready

Safe staging for this asset

This asset is staged first. The copied prompt tells the agent to inspect the staged files and ask before activating scripts, MCP config, or global config.

Stage only · 29/100Policy: stage
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Stage only
Trust
Trust: Established
Entrypoint
LoRAX
Safe staging command
npx -y tokrepo@latest install 1ff25e23-57ae-11f1-9bc6-00163e2b0d79 --target codex

Stages files first; activation requires review of the staged README and plan.

Introduction

LoRAX (LoRA eXchange) is an open-source multi-LoRA inference server that can serve thousands of fine-tuned LoRA adapters simultaneously from a single shared base model. By dynamically loading and unloading adapters on demand, LoRAX eliminates the need to deploy separate model instances for each fine-tuned variant, dramatically reducing GPU costs.

What LoRAX Does

  • Serves multiple LoRA adapters from one base model deployment
  • Dynamically loads adapters from Hugging Face Hub or local storage on request
  • Implements continuous batching across different adapters in the same batch
  • Manages adapter lifecycle with LRU caching and preloading
  • Exposes OpenAI-compatible and native REST APIs

Architecture Overview

LoRAX extends the text-generation-inference (TGI) architecture with a multi-adapter scheduler. The base model weights stay in GPU memory permanently, while LoRA adapter weights are loaded into a dedicated adapter cache. During batching, requests targeting different adapters are grouped together, with adapter-specific computations applied per-request during the forward pass using custom CUDA kernels.

Self-Hosting & Configuration

  • Deploy via Docker with NVIDIA GPU support
  • Specify the base model ID from Hugging Face on startup
  • Adapters load dynamically via the adapter_id request parameter
  • Configure adapter cache size based on available GPU memory
  • Supports quantized base models (AWQ, GPTQ, bitsandbytes) for reduced memory

Key Features

  • Serve 1000+ adapters from a single GPU deployment
  • Hot-swap adapters without server restart or downtime
  • Heterogeneous batching mixes requests for different adapters
  • Adapter preloading for latency-sensitive use cases
  • Compatible with any Hugging Face LoRA/PEFT adapter

Comparison with Similar Tools

  • vLLM — general LLM serving; LoRAX specializes in multi-adapter serving with shared base models
  • TGI (Text Generation Inference) — single-model focus; LoRAX extends it with multi-LoRA support
  • SGLang — prefix caching; LoRAX focuses on adapter multiplexing
  • Ollama — local single-model use; LoRAX targets multi-tenant production serving

FAQ

Q: How many adapters can LoRAX serve simultaneously? A: Depends on adapter size and GPU memory. Typical rank-16 adapters use roughly 30 MB each, so a 24 GB GPU can cache hundreds.

Q: Does serving multiple adapters slow down inference? A: Minimally. The adapter computation adds small overhead per token. Batching across adapters amortizes the base model cost.

Q: Can I train adapters with LoRAX? A: No. LoRAX is inference-only. Train adapters with PEFT, Axolotl, or Unsloth, then serve them with LoRAX.

Q: Which base models are supported? A: Llama, Mistral, Qwen, Gemma, Phi, and most decoder-only Hugging Face architectures.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets