How do I install ggml — Lightweight Tensor Library for Machine Learning in C?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

ggml — Lightweight Tensor Library for Machine Learning in C

Introduction

ggml is a tensor computation library written in C that focuses on efficient CPU inference for machine learning models. It is the engine behind llama.cpp and whisper.cpp, enabling millions of users to run large language models and speech recognition locally without requiring a GPU.

What ggml Does

Provides tensor operations optimized for CPU inference (AVX, AVX2, AVX-512, ARM NEON)
Supports integer quantization formats (Q4, Q5, Q8) to reduce memory usage
Implements automatic differentiation for training small models
Offers a computation graph API for defining and executing model architectures
Powers the GGUF model format used across the local AI ecosystem

Architecture Overview

ggml represents computations as a directed acyclic graph of tensor operations. Users build a computation graph by chaining operations, then execute it in a single pass. Memory is managed through a scratch buffer allocator that minimizes allocations. Quantization kernels are hand-optimized in C and assembly for each target architecture, achieving high throughput without GPU dependencies.

Self-Hosting & Configuration

Build with CMake on Linux, macOS, or Windows
No external dependencies beyond a C compiler
Enable BLAS backends (OpenBLAS, Apple Accelerate) for matrix multiply acceleration
Optional CUDA and Metal backends for GPU offloading
Configure quantization level based on available RAM vs. quality tradeoff

Key Features

Zero external dependencies for the core library
Aggressive quantization (4-bit, 5-bit) with minimal quality loss
Hand-tuned SIMD kernels for x86 and ARM platforms
Memory-mapped model loading for instant startup
Foundation of the GGUF ecosystem (llama.cpp, whisper.cpp, and more)

Comparison with Similar Tools

PyTorch — GPU-first training framework; ggml targets CPU inference and edge deployment
ONNX Runtime — cross-platform inference with graph optimization; ggml offers deeper quantization support
TensorFlow Lite — mobile inference runtime; ggml supports larger models via aggressive quantization
Candle — Rust ML framework by Hugging Face; ggml is C-based with broader quantization format support

FAQ

Q: Is ggml only for LLMs? A: No, it supports general tensor operations. It powers speech, vision, and language models.

Q: What is the GGUF format? A: GGUF is the model file format developed alongside ggml for storing quantized model weights with metadata.

Q: Does ggml support GPU acceleration? A: Yes, optional CUDA and Metal backends can offload computation to GPUs, though CPU remains the primary target.

Q: How much RAM do quantized models need? A: A 7B parameter model at Q4 quantization requires roughly 4 GB of RAM.

Sources

https://github.com/ggml-org/ggml

ggml — Lightweight Tensor Library for Machine Learning in C

Introduction

What ggml Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discusión

Activos relacionados

Lit — Simple Library for Fast Lightweight Web Components

Flower — Federated Learning Framework for Any ML Platform

tinygrad — Minimalist Deep Learning Framework

einops — Flexible and Readable Tensor Operations