# bitsandbytes — Accessible Large Language Model Quantization > Lightweight CUDA library for 8-bit and 4-bit quantization, enabling large model fine-tuning and inference on consumer GPUs with minimal accuracy loss. ## Install Save as a script file and run: # bitsandbytes — Accessible Large Language Model Quantization ## Quick Use ```bash pip install bitsandbytes ``` ```python import torch, bitsandbytes as bnb linear = bnb.nn.Linear8bitLt(1024, 512, has_fp16_weights=False) # Or use with transformers: from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b", load_in_4bit=True) ``` ## Introduction bitsandbytes provides custom CUDA functions for quantized training and inference. It is the engine behind QLoRA and 4-bit model loading in Hugging Face Transformers, letting users run and fine-tune billion-parameter models on a single consumer GPU. ## What bitsandbytes Does - Provides 8-bit optimizers (Adam, AdamW, SGD) that reduce optimizer memory by 75% - Implements 4-bit NormalFloat (NF4) quantization for LLM inference - Enables QLoRA training by combining 4-bit base weights with LoRA adapters - Offers 8-bit matrix multiplication kernels for linear layers - Supports multi-GPU setups and works with Hugging Face Accelerate ## Architecture Overview bitsandbytes wraps custom CUDA kernels that perform blockwise quantization. For 8-bit inference, weight matrices are decomposed into an Int8 component and a small Float16 outlier matrix. For 4-bit, weights use NormalFloat or FP4 data types with double quantization of scaling constants, cutting memory further. The Python layer provides drop-in replacements for torch.nn.Linear and standard optimizers. ## Self-Hosting & Configuration - Requires CUDA 11.0+ and a compatible NVIDIA GPU - Install via pip; pre-built wheels available for major CUDA versions - Multi-backend support added for AMD ROCm and Intel CPUs - Control quantization type via bnb.nn.Linear4bit(compute_dtype, quant_type) - Integrates with transformers via BitsAndBytesConfig for one-line setup ## Key Features - QLoRA support for fine-tuning 65B+ models on a single 48GB GPU - Blockwise 8-bit optimizers with dynamic exponent handling - NF4 and FP4 data types optimized for normal weight distributions - Double quantization to reduce quantization constant overhead - Native integration with Hugging Face Transformers and PEFT ## Comparison with Similar Tools - **GPTQ** — Post-training quantization via calibration; bitsandbytes is simpler and supports training - **AWQ** — Activation-aware quantization with similar accuracy; requires a calibration step - **llama.cpp** — CPU-focused GGUF quantization for inference only - **AutoGPTQ** — Wraps GPTQ for Hugging Face; bitsandbytes has tighter ecosystem integration ## FAQ **Q: What is QLoRA?** A: QLoRA loads a model in 4-bit precision with bitsandbytes and trains small LoRA adapter weights in 16-bit, dramatically reducing memory requirements. **Q: Do I need an NVIDIA GPU?** A: NVIDIA is best supported. Experimental AMD ROCm and Intel CPU backends are available but less mature. **Q: How much memory does 4-bit save?** A: A 7B model drops from roughly 14 GB (FP16) to about 4 GB in 4-bit, plus overhead for activations. **Q: Is accuracy affected?** A: NF4 quantization preserves most accuracy. Benchmark evaluations typically show less than 1% degradation versus FP16 on standard tasks. ## Sources - https://github.com/bitsandbytes-foundation/bitsandbytes - https://huggingface.co/docs/bitsandbytes --- Source: https://tokrepo.com/en/workflows/6fa2ec3f-42dc-11f1-9bc6-00163e2b0d79 Author: Script Depot