# LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs > An open-source inference server by Predibase that serves thousands of fine-tuned LoRA adapters on a single base model with shared GPU memory. ## Install Save in your project root: # LoRAX — Multi-LoRA Inference Server for Fine-Tuned LLMs ## Quick Use ```bash docker run --gpus all -p 8080:80 -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN ghcr.io/predibase/lorax:latest --model-id meta-llama/Llama-3-8B # Request with a LoRA adapter: curl localhost:8080/generate -d '{"inputs":"Hello","parameters":{"adapter_id":"my-org/my-lora"}}' ``` ## Introduction LoRAX (LoRA eXchange) is an open-source multi-LoRA inference server that can serve thousands of fine-tuned LoRA adapters simultaneously from a single shared base model. By dynamically loading and unloading adapters on demand, LoRAX eliminates the need to deploy separate model instances for each fine-tuned variant, dramatically reducing GPU costs. ## What LoRAX Does - Serves multiple LoRA adapters from one base model deployment - Dynamically loads adapters from Hugging Face Hub or local storage on request - Implements continuous batching across different adapters in the same batch - Manages adapter lifecycle with LRU caching and preloading - Exposes OpenAI-compatible and native REST APIs ## Architecture Overview LoRAX extends the text-generation-inference (TGI) architecture with a multi-adapter scheduler. The base model weights stay in GPU memory permanently, while LoRA adapter weights are loaded into a dedicated adapter cache. During batching, requests targeting different adapters are grouped together, with adapter-specific computations applied per-request during the forward pass using custom CUDA kernels. ## Self-Hosting & Configuration - Deploy via Docker with NVIDIA GPU support - Specify the base model ID from Hugging Face on startup - Adapters load dynamically via the adapter_id request parameter - Configure adapter cache size based on available GPU memory - Supports quantized base models (AWQ, GPTQ, bitsandbytes) for reduced memory ## Key Features - Serve 1000+ adapters from a single GPU deployment - Hot-swap adapters without server restart or downtime - Heterogeneous batching mixes requests for different adapters - Adapter preloading for latency-sensitive use cases - Compatible with any Hugging Face LoRA/PEFT adapter ## Comparison with Similar Tools - **vLLM** — general LLM serving; LoRAX specializes in multi-adapter serving with shared base models - **TGI (Text Generation Inference)** — single-model focus; LoRAX extends it with multi-LoRA support - **SGLang** — prefix caching; LoRAX focuses on adapter multiplexing - **Ollama** — local single-model use; LoRAX targets multi-tenant production serving ## FAQ **Q: How many adapters can LoRAX serve simultaneously?** A: Depends on adapter size and GPU memory. Typical rank-16 adapters use roughly 30 MB each, so a 24 GB GPU can cache hundreds. **Q: Does serving multiple adapters slow down inference?** A: Minimally. The adapter computation adds small overhead per token. Batching across adapters amortizes the base model cost. **Q: Can I train adapters with LoRAX?** A: No. LoRAX is inference-only. Train adapters with PEFT, Axolotl, or Unsloth, then serve them with LoRAX. **Q: Which base models are supported?** A: Llama, Mistral, Qwen, Gemma, Phi, and most decoder-only Hugging Face architectures. ## Sources - https://github.com/predibase/lorax - https://docs.predibase.com/lorax/ --- Source: https://tokrepo.com/en/workflows/asset-1ff25e23 Author: AI Open Source