Scripts2026年4月15日·1 分钟阅读

RocksDB — Facebook's Embeddable Persistent Key-Value Store

A C++ LSM-tree storage engine from Meta powering MyRocks, CockroachDB, TiKV, Kafka Streams and more. Optimized for fast SSDs and workloads with high write throughput and low read latency.

Introduction

RocksDB is an embeddable persistent key-value store written in C++, forked from LevelDB at Facebook in 2012 and tuned for flash storage and server workloads. It powers MyRocks, CockroachDB, TiKV, YugabyteDB, Kafka Streams state stores and countless custom data systems, because it removes the pain of writing a durable, concurrent, crash-safe LSM engine from scratch.

What RocksDB Does

  • Stores ordered byte-string keys with atomic Put/Get/Delete/Merge operations and range iteration.
  • Uses a log-structured merge tree with memtables, immutable SST files and leveled/universal compaction.
  • Provides column families so one process can host many independent keyspaces sharing a WAL.
  • Supports snapshots, transactions (optimistic & pessimistic) and checkpoints for consistent backups.
  • Exposes thousands of tuning knobs (block cache, bloom filters, rate limiter, compression) for SSD, HDD or RAM-heavy setups.

Architecture Overview

Writes enter a write-ahead log and an in-memory skiplist memtable. When memtables fill, they are flushed to sorted string tables (SSTs) on disk and organized into levels, where background compaction merges and reorders data to keep reads fast. A block cache holds hot SST blocks, bloom filters skip SSTs that cannot contain a key and rate limiters throttle I/O so compaction does not starve foreground traffic.

Self-Hosting & Configuration

  • Link against librocksdb or one of the official bindings (Java, Go, Rust, Python, Node.js).
  • Pick a compaction style: leveled for read-heavy, universal for write-heavy, FIFO for logs/metrics.
  • Size write_buffer_size, max_write_buffer_number and level0_slowdown_writes_trigger for your SSD bandwidth.
  • Enable BlobDB when values are much larger than keys to avoid rewriting them during compaction.
  • Tune block_cache, bloom filter bits/key and direct I/O separately for reads vs compaction.

Key Features

  • Merge operators for read-modify-write-free counters and CRDT-like aggregates.
  • Pessimistic and optimistic transactions with serializable snapshots.
  • Online backups, incremental checkpoints and replication-friendly WAL shipping.
  • Rate limiters, write stalls and I/O priority classes so compaction respects SLOs.
  • Column families, prefix extractors and iterator pinning for efficient composite-key workloads.

Comparison with Similar Tools

  • LevelDB — the ancestor, single-threaded and much less tunable.
  • LMDB — mmap B-tree; great reads, weaker on write-heavy workloads.
  • BadgerDB — pure-Go LSM, no CGO, but fewer features than RocksDB.
  • WiredTiger — MongoDB's engine; hybrid B-tree/LSM with different trade-offs.
  • Sled / Fjall — modern Rust embedded KVs; simpler API, smaller ecosystem.

FAQ

Q: Is RocksDB a database or a library? A: A library. It gives you an engine; you implement the server, schema and networking on top. Q: Does RocksDB support replication? A: Not built-in. Ship the WAL, use checkpoints, or build on a system like Raft (as TiKV does). Q: When should I pick LMDB instead? A: When your workload is read-dominant, fits mostly in RAM and you want zero-copy mmap reads. Q: Is it safe to run RocksDB in production on a single node? A: Yes — MyRocks and CockroachDB run it at massive scale. Tune compaction and monitor stalls.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产