# Hugging Face Transformers — The Universal Library for Pretrained Models

> transformers is the de-facto Python library for using and fine-tuning pretrained models — BERT, GPT, Llama, Whisper, ViT, and 250,000+ others. One unified API works across PyTorch, TensorFlow, and JAX.

## Install

Save in your project root:

# Hugging Face Transformers

## Quick Use
```bash
pip install transformers torch
```

```python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
print(classifier("Transformers is incredible.")[0])
# {'label': 'POSITIVE', 'score': 0.99}

generator = pipeline("text-generation", model="meta-llama/Llama-3.2-1B")
print(generator("The future of AI is", max_new_tokens=40)[0]["generated_text"])
```

## Introduction
Hugging Face Transformers is the most influential library in modern AI. With over 159,000 GitHub stars (and roughly 40 million monthly downloads), it gives you a unified Python API to nearly every important pretrained model — language, vision, speech, video, multimodal — across PyTorch, TensorFlow, and JAX backends.

When a new model drops (Llama, Mistral, Qwen, Gemma, DeepSeek, Stable Diffusion encoders, Whisper, SAM, etc.), the Transformers integration is often the reference implementation. The Hub hosts 1M+ models that load with one line of code.

## What Transformers Does
The library provides three high-level tools: **Pipelines** (one-liner inference for common tasks), **AutoModel/AutoTokenizer** (load any model + tokenizer by name), and **Trainer** (training/fine-tuning loop with mixed precision, gradient accumulation, multi-GPU). It integrates with `accelerate`, `peft`, `datasets`, and `evaluate` for end-to-end ML workflows.

## Architecture Overview
```
from_pretrained("meta-llama/Llama-3.2-1B")
        |
   [Hub Integration]
   download weights + tokenizer from huggingface.co
        |
   [AutoModel / AutoTokenizer]
   model-class dispatch by config.json
        |
   [Backend Choice]
   PyTorch / TensorFlow / JAX (Flax)
        |
   [Pipelines / Direct Use / Trainer]
   inference, batching, generation
   training, eval, hyperparam search
        |
   [Ecosystem Integrations]
   accelerate (multi-GPU), peft (LoRA),
   datasets (data loading), evaluate (metrics),
   bitsandbytes (4-bit), text-generation-inference (serving)
```

## Self-Hosting & Configuration
```python
# Loading a 4-bit quantized model for low-VRAM inference
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    quantization_config=bnb,
    device_map="auto",
)

msg = [{"role": "user", "content": "Explain Transformers in one sentence."}]
inputs = tok.apply_chat_template(msg, return_tensors="pt", add_generation_prompt=True).to(model.device)
out = model.generate(inputs, max_new_tokens=128)
print(tok.decode(out[0], skip_special_tokens=True))

# Fine-tuning with PEFT (LoRA) — small adapters, big effect
from peft import LoraConfig, get_peft_model
lora = LoraConfig(r=8, target_modules=["q_proj","v_proj"], lora_alpha=16, lora_dropout=0.05)
model = get_peft_model(model, lora)
# train with Trainer ...
```

## Key Features
- **One API, every model** — Llama, Qwen, Mistral, Gemma, BERT, ViT, Whisper, ...
- **Pipelines** — one-liners for sentiment, NER, QA, translation, summarization, etc.
- **Auto classes** — automatic model + tokenizer dispatch
- **Trainer** — training loop with mixed precision, multi-GPU, callbacks
- **PEFT integration** — LoRA / QLoRA / DoRA with a few lines
- **Quantization** — 8-bit and 4-bit via bitsandbytes for low-VRAM inference
- **Multi-backend** — PyTorch, TensorFlow, JAX/Flax
- **Hub-native** — push/pull models, datasets, metrics with `huggingface_hub`

## Comparison with Similar Tools
| Feature | Transformers | vLLM | TGI | llama.cpp | OpenLLM |
|---|---|---|---|---|---|
| Scope | Train + infer (broad) | Fast inference | Production serving | Local CPU/GPU | Serving (BentoML) |
| Languages | Python | Python | Python (server) | C/C++ | Python |
| Throughput | Good | Best | Best | Excellent (CPU) | Good |
| Fine-tuning | Yes (Trainer + PEFT) | No | No | No | No |
| Model breadth | All HF models | Most LLMs | LLMs (HF) | Llama-family GGUF | HF + custom |
| Best For | Research + training | LLM serving | Production serving | Edge / local | Bento ecosystem |

## FAQ
**Q: Transformers vs vLLM/TGI for production?**
A: Use Transformers for training and prototyping; switch to vLLM or TGI for production inference (they share the same model weights but are 5–20x faster at serving thanks to PagedAttention/continuous batching).

**Q: Do I need a GPU?**
A: Many small models (Qwen 2.5-1.5B, Phi-3-mini) run on CPU. For 7B+ practical inference, a GPU with 8GB+ VRAM is recommended. Use 4-bit quantization to fit larger models on smaller GPUs.

**Q: How does it relate to Hugging Face Hub?**
A: Transformers loads/saves directly from the Hub: `from_pretrained("user/model")`. Push trained models with `push_to_hub()`. The Hub hosts the weights, datasets, demos (Spaces), and metrics.

**Q: Is Transformers good for non-text models?**
A: Yes. Vision (ViT, DETR, Segment Anything), speech (Whisper, MMS), multimodal (CLIP, BLIP, LLaVA), and time-series (PatchTST) all live in the same library.

## Sources
- GitHub: https://github.com/huggingface/transformers
- Docs: https://huggingface.co/docs/transformers
- Company: Hugging Face
- License: Apache-2.0

---
Source: https://tokrepo.com/en/workflows/b0920ac9-37db-11f1-9bc6-00163e2b0d79
Author: AI Open Source