# Hugging Face Transformers — The Universal Library for Pretrained Models > transformers is the de-facto Python library for using and fine-tuning pretrained models — BERT, GPT, Llama, Whisper, ViT, and 250,000+ others. One unified API works across PyTorch, TensorFlow, and JAX. ## Install Save in your project root: # Hugging Face Transformers ## Quick Use ```bash pip install transformers torch ``` ```python from transformers import pipeline classifier = pipeline("sentiment-analysis") print(classifier("Transformers is incredible.")[0]) # {'label': 'POSITIVE', 'score': 0.99} generator = pipeline("text-generation", model="meta-llama/Llama-3.2-1B") print(generator("The future of AI is", max_new_tokens=40)[0]["generated_text"]) ``` ## Introduction Hugging Face Transformers is the most influential library in modern AI. With over 159,000 GitHub stars (and roughly 40 million monthly downloads), it gives you a unified Python API to nearly every important pretrained model — language, vision, speech, video, multimodal — across PyTorch, TensorFlow, and JAX backends. When a new model drops (Llama, Mistral, Qwen, Gemma, DeepSeek, Stable Diffusion encoders, Whisper, SAM, etc.), the Transformers integration is often the reference implementation. The Hub hosts 1M+ models that load with one line of code. ## What Transformers Does The library provides three high-level tools: **Pipelines** (one-liner inference for common tasks), **AutoModel/AutoTokenizer** (load any model + tokenizer by name), and **Trainer** (training/fine-tuning loop with mixed precision, gradient accumulation, multi-GPU). It integrates with `accelerate`, `peft`, `datasets`, and `evaluate` for end-to-end ML workflows. ## Architecture Overview ``` from_pretrained("meta-llama/Llama-3.2-1B") | [Hub Integration] download weights + tokenizer from huggingface.co | [AutoModel / AutoTokenizer] model-class dispatch by config.json | [Backend Choice] PyTorch / TensorFlow / JAX (Flax) | [Pipelines / Direct Use / Trainer] inference, batching, generation training, eval, hyperparam search | [Ecosystem Integrations] accelerate (multi-GPU), peft (LoRA), datasets (data loading), evaluate (metrics), bitsandbytes (4-bit), text-generation-inference (serving) ``` ## Self-Hosting & Configuration ```python # Loading a 4-bit quantized model for low-VRAM inference import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct") model = AutoModelForCausalLM.from_pretrained( "Qwen/Qwen2.5-7B-Instruct", quantization_config=bnb, device_map="auto", ) msg = [{"role": "user", "content": "Explain Transformers in one sentence."}] inputs = tok.apply_chat_template(msg, return_tensors="pt", add_generation_prompt=True).to(model.device) out = model.generate(inputs, max_new_tokens=128) print(tok.decode(out[0], skip_special_tokens=True)) # Fine-tuning with PEFT (LoRA) — small adapters, big effect from peft import LoraConfig, get_peft_model lora = LoraConfig(r=8, target_modules=["q_proj","v_proj"], lora_alpha=16, lora_dropout=0.05) model = get_peft_model(model, lora) # train with Trainer ... ``` ## Key Features - **One API, every model** — Llama, Qwen, Mistral, Gemma, BERT, ViT, Whisper, ... - **Pipelines** — one-liners for sentiment, NER, QA, translation, summarization, etc. - **Auto classes** — automatic model + tokenizer dispatch - **Trainer** — training loop with mixed precision, multi-GPU, callbacks - **PEFT integration** — LoRA / QLoRA / DoRA with a few lines - **Quantization** — 8-bit and 4-bit via bitsandbytes for low-VRAM inference - **Multi-backend** — PyTorch, TensorFlow, JAX/Flax - **Hub-native** — push/pull models, datasets, metrics with `huggingface_hub` ## Comparison with Similar Tools | Feature | Transformers | vLLM | TGI | llama.cpp | OpenLLM | |---|---|---|---|---|---| | Scope | Train + infer (broad) | Fast inference | Production serving | Local CPU/GPU | Serving (BentoML) | | Languages | Python | Python | Python (server) | C/C++ | Python | | Throughput | Good | Best | Best | Excellent (CPU) | Good | | Fine-tuning | Yes (Trainer + PEFT) | No | No | No | No | | Model breadth | All HF models | Most LLMs | LLMs (HF) | Llama-family GGUF | HF + custom | | Best For | Research + training | LLM serving | Production serving | Edge / local | Bento ecosystem | ## FAQ **Q: Transformers vs vLLM/TGI for production?** A: Use Transformers for training and prototyping; switch to vLLM or TGI for production inference (they share the same model weights but are 5–20x faster at serving thanks to PagedAttention/continuous batching). **Q: Do I need a GPU?** A: Many small models (Qwen 2.5-1.5B, Phi-3-mini) run on CPU. For 7B+ practical inference, a GPU with 8GB+ VRAM is recommended. Use 4-bit quantization to fit larger models on smaller GPUs. **Q: How does it relate to Hugging Face Hub?** A: Transformers loads/saves directly from the Hub: `from_pretrained("user/model")`. Push trained models with `push_to_hub()`. The Hub hosts the weights, datasets, demos (Spaces), and metrics. **Q: Is Transformers good for non-text models?** A: Yes. Vision (ViT, DETR, Segment Anything), speech (Whisper, MMS), multimodal (CLIP, BLIP, LLaVA), and time-series (PatchTST) all live in the same library. ## Sources - GitHub: https://github.com/huggingface/transformers - Docs: https://huggingface.co/docs/transformers - Company: Hugging Face - License: Apache-2.0 --- Source: https://tokrepo.com/en/workflows/b0920ac9-37db-11f1-9bc6-00163e2b0d79 Author: AI Open Source