ConfigsMay 24, 2026·2 min read

ds4 — DeepSeek Local Inference Engine by antirez

A minimal C-based local inference engine for DeepSeek 4 Flash, optimized for Apple Metal and NVIDIA CUDA, created by the author of Redis.

Agent ready

This asset can be read and installed directly by agents

TokRepo exposes a universal CLI command, install contract, metadata JSON, adapter-aware plan, and raw content links so agents can judge fit, risk, and next actions.

Native · 98/100Policy: allow
Agent surface
Any MCP/CLI agent
Kind
Skill
Install
Single
Trust
Trust: Established
Entrypoint
ds4 Overview
Universal CLI install command
npx tokrepo install da2cc493-5727-11f1-9bc6-00163e2b0d79

Introduction

ds4 is a focused, minimal inference engine for running DeepSeek 4 Flash models locally. Written in C by Salvatore Sanfilippo (antirez, creator of Redis), it prioritizes simplicity and performance on consumer hardware with Metal and CUDA acceleration.

What ds4 Does

  • Runs DeepSeek 4 Flash models locally on Mac and Linux
  • Leverages Apple Metal for GPU acceleration on macOS
  • Supports NVIDIA CUDA for GPU inference on Linux
  • Loads GGUF-format quantized models for memory efficiency
  • Provides a simple CLI interface for interactive and batch inference

Architecture Overview

ds4 is a single-file C program with minimal dependencies, following the same design philosophy antirez applied to Redis. It implements a streamlined transformer inference loop with KV-cache management, quantized weight loading, and platform-specific GPU kernels for Metal and CUDA. The codebase deliberately avoids frameworks in favor of direct hardware access.

Self-Hosting & Configuration

  • Build from source with make (no package manager needed)
  • Download GGUF model files from Hugging Face or other sources
  • Configure GPU layers and context size via CLI flags
  • Set batch size and thread count for CPU-only inference
  • No Docker required; runs as a single native binary

Key Features

  • Minimal C codebase with zero external dependencies
  • Native Metal acceleration for Apple Silicon Macs
  • CUDA support for NVIDIA GPUs
  • GGUF model format for efficient quantized inference
  • Created by antirez with the same simplicity principles as Redis

Comparison with Similar Tools

  • llama.cpp — broader model support; ds4 is DeepSeek-specialized and simpler
  • Ollama — user-friendly wrapper; ds4 is a raw inference engine
  • vLLM — production server; ds4 targets single-user local inference
  • ExLlamaV2 — Python-based; ds4 is pure C with no runtime dependencies

FAQ

Q: Which models does ds4 support? A: It is specifically designed for DeepSeek 4 Flash models in GGUF format.

Q: Do I need a GPU? A: No. It runs on CPU, but Metal (Mac) or CUDA (NVIDIA) acceleration significantly improves speed.

Q: Who is antirez? A: Salvatore Sanfilippo, the creator of Redis. He is known for writing high-quality, minimal C programs.

Q: How does it compare to llama.cpp? A: ds4 is narrower in scope, targeting only DeepSeek models. This focus allows a simpler codebase and potentially better optimization for that model family.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets