# Apache Cassandra — Distributed Wide-Column Database at Scale > Apache Cassandra is an open-source, distributed, wide-column NoSQL database. Linear scalability and proven fault-tolerance on commodity hardware. Used at Netflix, Apple, Instagram, and eBay for petabyte-scale workloads with high availability. ## Install Save the content below to `.claude/skills/` or append to your `CLAUDE.md`: ## Quick Use ```bash # Docker docker run -d --name cassandra -p 9042:9042 cassandra:5 # CQL shell docker exec -it cassandra cqlsh ``` CQL usage: ```sql CREATE KEYSPACE tokrepo WITH REPLICATION = { "class": "SimpleStrategy", "replication_factor": 1 }; USE tokrepo; CREATE TABLE events_by_user ( user_id uuid, event_time timestamp, event_type text, payload text, PRIMARY KEY (user_id, event_time) ) WITH CLUSTERING ORDER BY (event_time DESC); INSERT INTO events_by_user (user_id, event_time, event_type, payload) VALUES (uuid(), toTimestamp(now()), "page_view", "/home"); -- Time-ordered query SELECT * FROM events_by_user WHERE user_id = 9b3a... AND event_time > "2026-01-01"; -- TTL (auto-delete after 7 days) INSERT INTO events_by_user (user_id, event_time, event_type) VALUES (uuid(), toTimestamp(now()), "login") USING TTL 604800; ``` ## Intro Apache Cassandra is a free and open-source, distributed, wide-column NoSQL database. Originally developed at Facebook in 2008 to handle inbox search, open-sourced in 2009, and graduated to Apache top-level in 2010. Built to provide scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure. - **Repo**: https://github.com/apache/cassandra - **Stars**: 9K+ - **Language**: Java - **License**: Apache 2.0 ## What Cassandra Does - **Wide-column model** — partition key + clustering keys + columns - **CQL query language** — SQL-like declarative syntax - **Masterless peer-to-peer** — all nodes equal (no single point of failure) - **Tunable consistency** — per-query consistency level (ONE, QUORUM, ALL) - **Multi-DC replication** — NetworkTopologyStrategy - **Lightweight transactions (LWT)** — Paxos-based compare-and-set - **Materialized views** — denormalized auto-maintained tables - **TTL** — per-cell time-to-live - **Gossip protocol** — peer state distribution - **Compaction strategies** — STCS, LCS, TWCS for different workloads ## Architecture Peer-to-peer ring: each node owns a range of partition keys determined by consistent hashing. Data is replicated to N nodes. Writes go to a memtable + commit log, flushed to SSTables. Reads merge SSTables + memtable + possibly bloom filters and row cache. ## Self-Hosting ```yaml # 3-node cluster version: "3" services: cassandra-node1: image: cassandra:5 environment: CASSANDRA_CLUSTER_NAME: tokrepo CASSANDRA_SEEDS: cassandra-node1 cassandra-node2: image: cassandra:5 environment: CASSANDRA_CLUSTER_NAME: tokrepo CASSANDRA_SEEDS: cassandra-node1 cassandra-node3: image: cassandra:5 environment: CASSANDRA_CLUSTER_NAME: tokrepo CASSANDRA_SEEDS: cassandra-node1 ``` ## Key Features - Linear horizontal scaling - Masterless architecture - Tunable consistency - Multi-DC replication - CQL query language - Secondary indexes - Materialized views - Lightweight transactions - TTL for auto-expiration - Battle-tested at petabyte scale ## Comparison | Database | Model | Consistency | Scale | |---|---|---|---| | Cassandra | Wide column | Tunable (AP) | Linear (masterless) | | ScyllaDB | Wide column (CQL compatible) | Tunable | Linear (shard-per-core) | | HBase | Wide column | Strong (CP) | Region servers | | DynamoDB | Key-value + doc | Tunable | Managed | | Bigtable | Wide column | Strong | Managed | | MongoDB | Document | Tunable | Sharding | ## FAQ **Q: Cassandra vs ScyllaDB?** A: Fully API-compatible. ScyllaDB is implemented in C++ with a shard-per-core architecture and performs several times better, but its commercial edition is proprietary. Cassandra is an Apache Foundation project with a more mature ecosystem. **Q: What scenarios is it good for?** A: Large-scale time series data, event logs, IoT, message history, and recommendation systems. Not suitable for scenarios requiring complex JOINs or strong transactions (no multi-table JOINs). **Q: Data modeling principles?** A: Query-driven. Decide on your query patterns first, then design your table schema. Denormalization and data duplication are the norm — one table per query pattern. ## Sources - Docs: https://cassandra.apache.org/doc - GitHub: https://github.com/apache/cassandra - License: Apache 2.0 --- Source: https://tokrepo.com/en/workflows/apache-cassandra-distributed-wide-column-database-scale-0114283a Author: Apache Software Foundation