What is LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs?

An open-source inference server by Predibase that serves thousands of fine-tuned LoRA adapters on a single base model with shared GPU memory.

Is LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs free to use?

Yes. LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs is freely available on TokRepo. Check the Source & Thanks section on the asset page for the specific open-source license.

How do I install LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs

Introduction

LoRAX (LoRA eXchange) is an open-source multi-LoRA inference server that can serve thousands of fine-tuned LoRA adapters simultaneously from a single shared base model. By dynamically loading and unloading adapters on demand, LoRAX eliminates the need to deploy separate model instances for each fine-tuned variant, dramatically reducing GPU costs.

What LoRAX Does

Serves multiple LoRA adapters from one base model deployment
Dynamically loads adapters from Hugging Face Hub or local storage on request
Implements continuous batching across different adapters in the same batch
Manages adapter lifecycle with LRU caching and preloading
Exposes OpenAI-compatible and native REST APIs

Architecture Overview

LoRAX extends the text-generation-inference (TGI) architecture with a multi-adapter scheduler. The base model weights stay in GPU memory permanently, while LoRA adapter weights are loaded into a dedicated adapter cache. During batching, requests targeting different adapters are grouped together, with adapter-specific computations applied per-request during the forward pass using custom CUDA kernels.

Self-Hosting & Configuration

Deploy via Docker with NVIDIA GPU support
Specify the base model ID from Hugging Face on startup
Adapters load dynamically via the adapter_id request parameter
Configure adapter cache size based on available GPU memory
Supports quantized base models (AWQ, GPTQ, bitsandbytes) for reduced memory

Key Features

Serve 1000+ adapters from a single GPU deployment
Hot-swap adapters without server restart or downtime
Heterogeneous batching mixes requests for different adapters
Adapter preloading for latency-sensitive use cases
Compatible with any Hugging Face LoRA/PEFT adapter

Comparison with Similar Tools

vLLM — general LLM serving; LoRAX specializes in multi-adapter serving with shared base models
TGI (Text Generation Inference) — single-model focus; LoRAX extends it with multi-LoRA support
SGLang — prefix caching; LoRAX focuses on adapter multiplexing
Ollama — local single-model use; LoRAX targets multi-tenant production serving

FAQ

Q: How many adapters can LoRAX serve simultaneously? A: Depends on adapter size and GPU memory. Typical rank-16 adapters use roughly 30 MB each, so a 24 GB GPU can cache hundreds.

Q: Does serving multiple adapters slow down inference? A: Minimally. The adapter computation adds small overhead per token. Batching across adapters amortizes the base model cost.

Q: Can I train adapters with LoRAX? A: No. LoRAX is inference-only. Train adapters with PEFT, Axolotl, or Unsloth, then serve them with LoRAX.

Q: Which base models are supported? A: Llama, Mistral, Qwen, Gemma, Phi, and most decoder-only Hugging Face architectures.

LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs

Staging sûr pour cet actif

Introduction

What LoRAX Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Fil de discussion

Actifs similaires

NVIDIA Triton Inference Server — Multi-Framework Model Serving at Scale

Axolotl — Streamlined LLM Fine-Tuning

LLaMA-Factory — Fine-Tune 100+ LLMs with a Unified Interface

Pyrefly — Fast Python Type Checker and Language Server by Meta