# CubeFS — Cloud-Native Distributed File System > CubeFS is a CNCF-graduated distributed storage system supporting S3, POSIX, and HDFS interfaces for cloud-native and AI workloads. ## Install Save in your project root: # CubeFS — Cloud-Native Distributed File System ## Quick Use ```bash # Deploy a minimal cluster with Docker Compose git clone https://github.com/cubefs/cubefs.git cd cubefs/docker/docker-compose docker compose up -d # Mount a CubeFS volume via FUSE cfs-client -c /etc/cubefs/fuse.json # Or access data via the built-in S3 gateway aws s3 ls --endpoint-url http://localhost:9000 ``` ## Introduction CubeFS is a cloud-native distributed storage system that provides unified access through POSIX, S3-compatible, and HDFS interfaces. Originally developed at JD.com and now a CNCF graduated project, it is designed for large-scale containerized environments, AI/ML training pipelines, and big data analytics. ## What CubeFS Does - Stores and serves data through POSIX mount, S3 API, and HDFS interface simultaneously - Scales storage capacity and throughput horizontally by adding data nodes - Supports multi-tenancy with per-volume quotas and access control - Provides erasure coding and multi-replica modes for tunable durability vs. cost - Integrates with Kubernetes via CSI driver for persistent volume provisioning ## Architecture Overview CubeFS separates metadata and data into independent subsystems. A metadata subsystem (MetaNode cluster with Raft consensus) manages the file namespace. A data subsystem (DataNode cluster) stores file chunks using either multi-replica or erasure coding. A master service coordinates cluster topology, volume management, and node health. Client libraries (FUSE, S3 gateway, HDFS adapter) translate protocol requests into internal RPCs. ## Self-Hosting & Configuration - Deploy master, meta, and data nodes via Docker Compose, Helm chart, or Ansible playbooks - Minimum viable cluster: 1 master, 3 meta nodes, 4 data nodes for erasure coding - Configure volumes with replication factor or erasure coding policy per use case - Use the Kubernetes CSI driver for dynamic PV provisioning in container workloads - Monitor with built-in Prometheus metrics endpoint and Grafana dashboards ## Key Features - Triple-protocol access (POSIX, S3, HDFS) from a single storage pool - Erasure coding reduces storage overhead to 1.5x compared to 3x replication - Multi-tenant volume isolation with per-tenant quotas and ACLs - Kubernetes CSI integration for seamless persistent volume management - CNCF graduated project with active community and enterprise production deployments ## Comparison with Similar Tools - **Ceph** — Mature and feature-rich but operationally complex; CubeFS aims for simpler deployment - **MinIO** — S3-only object storage, no POSIX or HDFS interface - **SeaweedFS** — Lightweight blob store with FUSE mount but no HDFS compatibility - **JuiceFS** — POSIX filesystem backed by object storage; CubeFS manages its own data nodes - **Longhorn** — Kubernetes block storage only, no file or object interface ## FAQ **Q: Is CubeFS suitable for AI/ML training data?** A: Yes. Its POSIX interface allows direct mount into training containers, and its S3 gateway supports frameworks that read from object storage. **Q: How does CubeFS handle node failures?** A: Multi-replica mode re-replicates chunks automatically. Erasure coding mode reconstructs missing shards from parity data across surviving nodes. **Q: Can CubeFS run on commodity hardware?** A: Yes. It is designed for standard x86 servers with local SSDs or HDDs and does not require specialized storage hardware. **Q: What is the minimum cluster size?** A: A functional cluster needs at least 1 master, 3 meta nodes, and 3 data nodes (or 4 for erasure coding). ## Sources - https://github.com/cubefs/cubefs - https://cubefs.io/docs/ --- Source: https://tokrepo.com/en/workflows/asset-80593d34 Author: AI Open Source