How do I install bitsandbytes — Accessible Large Language Model Quantization?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

bitsandbytes — Accessible Large Language Model Quantization

Introduction

bitsandbytes provides custom CUDA functions for quantized training and inference. It is the engine behind QLoRA and 4-bit model loading in Hugging Face Transformers, letting users run and fine-tune billion-parameter models on a single consumer GPU.

What bitsandbytes Does

Provides 8-bit optimizers (Adam, AdamW, SGD) that reduce optimizer memory by 75%
Implements 4-bit NormalFloat (NF4) quantization for LLM inference
Enables QLoRA training by combining 4-bit base weights with LoRA adapters
Offers 8-bit matrix multiplication kernels for linear layers
Supports multi-GPU setups and works with Hugging Face Accelerate

Architecture Overview

bitsandbytes wraps custom CUDA kernels that perform blockwise quantization. For 8-bit inference, weight matrices are decomposed into an Int8 component and a small Float16 outlier matrix. For 4-bit, weights use NormalFloat or FP4 data types with double quantization of scaling constants, cutting memory further. The Python layer provides drop-in replacements for torch.nn.Linear and standard optimizers.

Self-Hosting & Configuration

Requires CUDA 11.0+ and a compatible NVIDIA GPU
Install via pip; pre-built wheels available for major CUDA versions
Multi-backend support added for AMD ROCm and Intel CPUs
Control quantization type via bnb.nn.Linear4bit(compute_dtype, quant_type)
Integrates with transformers via BitsAndBytesConfig for one-line setup

Key Features

QLoRA support for fine-tuning 65B+ models on a single 48GB GPU
Blockwise 8-bit optimizers with dynamic exponent handling
NF4 and FP4 data types optimized for normal weight distributions
Double quantization to reduce quantization constant overhead
Native integration with Hugging Face Transformers and PEFT

Comparison with Similar Tools

GPTQ — Post-training quantization via calibration; bitsandbytes is simpler and supports training
AWQ — Activation-aware quantization with similar accuracy; requires a calibration step
llama.cpp — CPU-focused GGUF quantization for inference only
AutoGPTQ — Wraps GPTQ for Hugging Face; bitsandbytes has tighter ecosystem integration

FAQ

Q: What is QLoRA? A: QLoRA loads a model in 4-bit precision with bitsandbytes and trains small LoRA adapter weights in 16-bit, dramatically reducing memory requirements.

Q: Do I need an NVIDIA GPU? A: NVIDIA is best supported. Experimental AMD ROCm and Intel CPU backends are available but less mature.

Q: How much memory does 4-bit save? A: A 7B model drops from roughly 14 GB (FP16) to about 4 GB in 4-bit, plus overhead for activations.

Q: Is accuracy affected? A: NF4 quantization preserves most accuracy. Benchmark evaluations typically show less than 1% degradation versus FP16 on standard tasks.

bitsandbytes — Accessible Large Language Model Quantization

Introduction

What bitsandbytes Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Flax — Neural Network Library for JAX

PyCaret — Low-Code Machine Learning in Python

DGL — Deep Graph Library for Scalable Graph Neural Networks