# Apache Cassandra — Distributed Wide-Column Database at Scale

> Apache Cassandra is an open-source, distributed, wide-column NoSQL database. Linear scalability and proven fault-tolerance on commodity hardware. Used at Netflix, Apple, Instagram, and eBay for petabyte-scale workloads with high availability.

## Install

Save the content below to `.claude/skills/` or append to your `CLAUDE.md`:

## Quick Use

```bash
# Docker
docker run -d --name cassandra -p 9042:9042 cassandra:5

# CQL shell
docker exec -it cassandra cqlsh
```

CQL usage:
```sql
CREATE KEYSPACE tokrepo WITH REPLICATION = {
  "class": "SimpleStrategy",
  "replication_factor": 1
};

USE tokrepo;

CREATE TABLE events_by_user (
  user_id uuid,
  event_time timestamp,
  event_type text,
  payload text,
  PRIMARY KEY (user_id, event_time)
) WITH CLUSTERING ORDER BY (event_time DESC);

INSERT INTO events_by_user (user_id, event_time, event_type, payload)
VALUES (uuid(), toTimestamp(now()), "page_view", "/home");

-- Time-ordered query
SELECT * FROM events_by_user
WHERE user_id = 9b3a...
AND event_time > "2026-01-01";

-- TTL (auto-delete after 7 days)
INSERT INTO events_by_user (user_id, event_time, event_type)
VALUES (uuid(), toTimestamp(now()), "login")
USING TTL 604800;
```

## Intro

Apache Cassandra is a free and open-source, distributed, wide-column NoSQL database. Originally developed at Facebook in 2008 to handle inbox search, open-sourced in 2009, and graduated to Apache top-level in 2010. Built to provide scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure.

- **Repo**: https://github.com/apache/cassandra
- **Stars**: 9K+
- **Language**: Java
- **License**: Apache 2.0

## What Cassandra Does

- **Wide-column model** — partition key + clustering keys + columns
- **CQL query language** — SQL-like declarative syntax
- **Masterless peer-to-peer** — all nodes equal (no single point of failure)
- **Tunable consistency** — per-query consistency level (ONE, QUORUM, ALL)
- **Multi-DC replication** — NetworkTopologyStrategy
- **Lightweight transactions (LWT)** — Paxos-based compare-and-set
- **Materialized views** — denormalized auto-maintained tables
- **TTL** — per-cell time-to-live
- **Gossip protocol** — peer state distribution
- **Compaction strategies** — STCS, LCS, TWCS for different workloads

## Architecture

Peer-to-peer ring: each node owns a range of partition keys determined by consistent hashing. Data is replicated to N nodes. Writes go to a memtable + commit log, flushed to SSTables. Reads merge SSTables + memtable + possibly bloom filters and row cache.

## Self-Hosting

```yaml
# 3-node cluster
version: "3"
services:
  cassandra-node1:
    image: cassandra:5
    environment:
      CASSANDRA_CLUSTER_NAME: tokrepo
      CASSANDRA_SEEDS: cassandra-node1
  cassandra-node2:
    image: cassandra:5
    environment:
      CASSANDRA_CLUSTER_NAME: tokrepo
      CASSANDRA_SEEDS: cassandra-node1
  cassandra-node3:
    image: cassandra:5
    environment:
      CASSANDRA_CLUSTER_NAME: tokrepo
      CASSANDRA_SEEDS: cassandra-node1
```

## Key Features

- Linear horizontal scaling
- Masterless architecture
- Tunable consistency
- Multi-DC replication
- CQL query language
- Secondary indexes
- Materialized views
- Lightweight transactions
- TTL for auto-expiration
- Battle-tested at petabyte scale

## Comparison

| Database | Model | Consistency | Scale |
|---|---|---|---|
| Cassandra | Wide column | Tunable (AP) | Linear (masterless) |
| ScyllaDB | Wide column (CQL compatible) | Tunable | Linear (shard-per-core) |
| HBase | Wide column | Strong (CP) | Region servers |
| DynamoDB | Key-value + doc | Tunable | Managed |
| Bigtable | Wide column | Strong | Managed |
| MongoDB | Document | Tunable | Sharding |

## FAQ

**Q: Cassandra vs ScyllaDB?**
A: Fully API-compatible. ScyllaDB is implemented in C++ with a shard-per-core architecture and performs several times better, but its commercial edition is proprietary. Cassandra is an Apache Foundation project with a more mature ecosystem.

**Q: What scenarios is it good for?**
A: Large-scale time series data, event logs, IoT, message history, and recommendation systems. Not suitable for scenarios requiring complex JOINs or strong transactions (no multi-table JOINs).

**Q: Data modeling principles?**
A: Query-driven. Decide on your query patterns first, then design your table schema. Denormalization and data duplication are the norm — one table per query pattern.

## Sources

- Docs: https://cassandra.apache.org/doc
- GitHub: https://github.com/apache/cassandra
- License: Apache 2.0

---
Source: https://tokrepo.com/en/workflows/apache-cassandra-distributed-wide-column-database-scale-0114283a
Author: Apache Software Foundation