# Hugging Face Datasets — Access and Process ML Datasets at Scale > Hugging Face Datasets is a Python library for efficiently loading, processing, and sharing machine learning datasets with Apache Arrow-backed memory mapping, streaming support, and access to thousands of community datasets on the Hub. ## Install Save in your project root: ## Quick Use ```bash pip install datasets python -c "from datasets import load_dataset; ds = load_dataset('imdb', split='train'); print(ds[0])" ``` ## Introduction Hugging Face Datasets is a lightweight library for accessing and manipulating machine learning datasets. It provides a unified API to load datasets from the Hugging Face Hub, local files, or in-memory objects, with smart caching and memory-mapping so you can work with datasets larger than RAM without special infrastructure. ## What Hugging Face Datasets Does - Loads thousands of public datasets from the Hugging Face Hub with a single function call - Handles CSV, JSON, Parquet, Arrow, text, image, and audio file formats natively - Memory-maps data using Apache Arrow so large datasets load instantly without copying into RAM - Supports streaming mode for datasets too large to download entirely - Provides map, filter, sort, and shuffle operations that run out-of-core on disk-backed data ## Architecture Overview The library stores all tabular data in Apache Arrow format on disk, using memory-mapped files for zero-copy reads. The Dataset class wraps an Arrow table and exposes pandas-like operations that execute lazily where possible. A DatasetDict groups splits (train/test/validation). The Hub integration uses HTTP streaming and partial downloads. Caching is automatic and content-addressed to avoid redundant processing. ## Self-Hosting & Configuration - Install with `pip install datasets` (optional extras for audio, image, and streaming) - Datasets are cached in `~/.cache/huggingface/datasets` by default; set `HF_DATASETS_CACHE` to change - Load Hub datasets with `load_dataset("name")` or local files with `load_dataset("csv", data_files="path")` - Enable streaming with `load_dataset("name", streaming=True)` for on-the-fly processing - Push processed datasets back to the Hub with `ds.push_to_hub("your-org/name")` ## Key Features - Apache Arrow backend enables memory-efficient processing of multi-GB datasets - Streaming mode processes data without downloading the full dataset - Built-in interoperability with pandas, NumPy, PyTorch, TensorFlow, and JAX - Versioned dataset scripts ensure reproducibility across environments - Community ecosystem with thousands of ready-to-use datasets ## Comparison with Similar Tools - **TensorFlow Datasets (TFDS)** — similar concept; tightly coupled to TensorFlow ecosystem - **torchdata** — PyTorch data loading; lower-level, no Hub integration - **pandas** — great for tabular data; struggles with datasets larger than RAM - **Polars** — fast DataFrame library; not designed for ML dataset workflows - **DVC** — version-controls data files; does not provide processing or Hub access ## FAQ **Q: Can I load a dataset larger than my RAM?** A: Yes. Arrow memory-mapping and streaming mode both handle datasets that exceed available memory. **Q: How do I use a private dataset from the Hub?** A: Pass `token=True` or set the `HF_TOKEN` environment variable. You need read access to the dataset repository. **Q: Does it support image and audio data?** A: Yes. Install the `Pillow` and `soundfile` extras. Image and audio columns are decoded lazily on access. **Q: Can I create a dataset from a pandas DataFrame?** A: Yes. Use `Dataset.from_pandas(df)` to convert directly, preserving column types. ## Sources - https://github.com/huggingface/datasets - https://huggingface.co/docs/datasets/ --- Source: https://tokrepo.com/en/workflows/dacb751e-42b9-11f1-9bc6-00163e2b0d79 Author: AI Open Source