# CubeFS — Cloud-Native Distributed File System

> CubeFS is a CNCF-graduated distributed storage system supporting S3, POSIX, and HDFS interfaces for cloud-native and AI workloads.

## Install

Save in your project root:

# CubeFS — Cloud-Native Distributed File System

## Quick Use
```bash
# Deploy a minimal cluster with Docker Compose
git clone https://github.com/cubefs/cubefs.git
cd cubefs/docker/docker-compose
docker compose up -d

# Mount a CubeFS volume via FUSE
cfs-client -c /etc/cubefs/fuse.json

# Or access data via the built-in S3 gateway
aws s3 ls --endpoint-url http://localhost:9000
```

## Introduction
CubeFS is a cloud-native distributed storage system that provides unified access through POSIX, S3-compatible, and HDFS interfaces. Originally developed at JD.com and now a CNCF graduated project, it is designed for large-scale containerized environments, AI/ML training pipelines, and big data analytics.

## What CubeFS Does
- Stores and serves data through POSIX mount, S3 API, and HDFS interface simultaneously
- Scales storage capacity and throughput horizontally by adding data nodes
- Supports multi-tenancy with per-volume quotas and access control
- Provides erasure coding and multi-replica modes for tunable durability vs. cost
- Integrates with Kubernetes via CSI driver for persistent volume provisioning

## Architecture Overview
CubeFS separates metadata and data into independent subsystems. A metadata subsystem (MetaNode cluster with Raft consensus) manages the file namespace. A data subsystem (DataNode cluster) stores file chunks using either multi-replica or erasure coding. A master service coordinates cluster topology, volume management, and node health. Client libraries (FUSE, S3 gateway, HDFS adapter) translate protocol requests into internal RPCs.

## Self-Hosting & Configuration
- Deploy master, meta, and data nodes via Docker Compose, Helm chart, or Ansible playbooks
- Minimum viable cluster: 1 master, 3 meta nodes, 4 data nodes for erasure coding
- Configure volumes with replication factor or erasure coding policy per use case
- Use the Kubernetes CSI driver for dynamic PV provisioning in container workloads
- Monitor with built-in Prometheus metrics endpoint and Grafana dashboards

## Key Features
- Triple-protocol access (POSIX, S3, HDFS) from a single storage pool
- Erasure coding reduces storage overhead to 1.5x compared to 3x replication
- Multi-tenant volume isolation with per-tenant quotas and ACLs
- Kubernetes CSI integration for seamless persistent volume management
- CNCF graduated project with active community and enterprise production deployments

## Comparison with Similar Tools
- **Ceph** — Mature and feature-rich but operationally complex; CubeFS aims for simpler deployment
- **MinIO** — S3-only object storage, no POSIX or HDFS interface
- **SeaweedFS** — Lightweight blob store with FUSE mount but no HDFS compatibility
- **JuiceFS** — POSIX filesystem backed by object storage; CubeFS manages its own data nodes
- **Longhorn** — Kubernetes block storage only, no file or object interface

## FAQ
**Q: Is CubeFS suitable for AI/ML training data?**
A: Yes. Its POSIX interface allows direct mount into training containers, and its S3 gateway supports frameworks that read from object storage.

**Q: How does CubeFS handle node failures?**
A: Multi-replica mode re-replicates chunks automatically. Erasure coding mode reconstructs missing shards from parity data across surviving nodes.

**Q: Can CubeFS run on commodity hardware?**
A: Yes. It is designed for standard x86 servers with local SSDs or HDDs and does not require specialized storage hardware.

**Q: What is the minimum cluster size?**
A: A functional cluster needs at least 1 master, 3 meta nodes, and 3 data nodes (or 4 for erasure coding).

## Sources
- https://github.com/cubefs/cubefs
- https://cubefs.io/docs/

---
Source: https://tokrepo.com/en/workflows/asset-80593d34
Author: AI Open Source