What Cassandra Does
- Wide-column model — partition key + clustering keys + columns
- CQL query language — SQL-like declarative syntax
- Masterless peer-to-peer — all nodes equal (no single point of failure)
- Tunable consistency — per-query consistency level (ONE, QUORUM, ALL)
- Multi-DC replication — NetworkTopologyStrategy
- Lightweight transactions (LWT) — Paxos-based compare-and-set
- Materialized views — denormalized auto-maintained tables
- TTL — per-cell time-to-live
- Gossip protocol — peer state distribution
- Compaction strategies — STCS, LCS, TWCS for different workloads
Architecture
Peer-to-peer ring: each node owns a range of partition keys determined by consistent hashing. Data is replicated to N nodes. Writes go to a memtable + commit log, flushed to SSTables. Reads merge SSTables + memtable + possibly bloom filters and row cache.
Self-Hosting
# 3-node cluster
version: "3"
services:
cassandra-node1:
image: cassandra:5
environment:
CASSANDRA_CLUSTER_NAME: tokrepo
CASSANDRA_SEEDS: cassandra-node1
cassandra-node2:
image: cassandra:5
environment:
CASSANDRA_CLUSTER_NAME: tokrepo
CASSANDRA_SEEDS: cassandra-node1
cassandra-node3:
image: cassandra:5
environment:
CASSANDRA_CLUSTER_NAME: tokrepo
CASSANDRA_SEEDS: cassandra-node1Key Features
- Linear horizontal scaling
- Masterless architecture
- Tunable consistency
- Multi-DC replication
- CQL query language
- Secondary indexes
- Materialized views
- Lightweight transactions
- TTL for auto-expiration
- Battle-tested at petabyte scale
Comparison
| Database | Model | Consistency | Scale |
|---|---|---|---|
| Cassandra | Wide column | Tunable (AP) | Linear (masterless) |
| ScyllaDB | Wide column (CQL compatible) | Tunable | Linear (shard-per-core) |
| HBase | Wide column | Strong (CP) | Region servers |
| DynamoDB | Key-value + doc | Tunable | Managed |
| Bigtable | Wide column | Strong | Managed |
| MongoDB | Document | Tunable | Sharding |
FAQ
Q: Cassandra vs ScyllaDB? A: Fully API-compatible. ScyllaDB is implemented in C++ with a shard-per-core architecture and performs several times better, but its commercial edition is proprietary. Cassandra is an Apache Foundation project with a more mature ecosystem.
Q: What scenarios is it good for? A: Large-scale time series data, event logs, IoT, message history, and recommendation systems. Not suitable for scenarios requiring complex JOINs or strong transactions (no multi-table JOINs).
Q: Data modeling principles? A: Query-driven. Decide on your query patterns first, then design your table schema. Denormalization and data duplication are the norm — one table per query pattern.
Sources
- Docs: https://cassandra.apache.org/doc
- GitHub: https://github.com/apache/cassandra
- License: Apache 2.0