Cette page est affichée en anglais. Une traduction française est en cours.
ScriptsJul 2, 2026·3 min de lecture

Apache Paimon — Streaming Data Lake Storage

Apache Paimon is a streaming data lake platform that supports both real-time streaming writes and high-performance batch reads using a lake format with changelog tracking.

Prêt pour agents

Installation agent prête

Cet actif peut être installé après choix du runtime, vérification du plan et exécution de la commande adaptée.

Native · 98/100Policy : autoriser
Surface agent
Tout agent MCP/CLI
Type
Skill
Installation
Single
Confiance
Confiance : Established
Point d'entrée
Apache Paimon Overview
Commande d'installation directe
npx -y tokrepo@latest install f5e9020a-75f0-11f1-9bc6-00163e2b0d79 --target codex

À exécuter après confirmation du plan en dry-run.

Introduction

Apache Paimon is a lake format that brings real-time streaming capabilities to data lakes. Originally developed as Flink Table Store, it enables continuous data ingestion with changelog tracking while maintaining compatibility with batch query engines like Spark, Hive, and Trino.

What Apache Paimon Does

  • Provides a table format for data lakes with support for streaming write and batch read
  • Tracks row-level changelogs for incremental processing and CDC pipelines
  • Supports primary key tables with upsert semantics and append-only tables
  • Integrates natively with Apache Flink, Spark, Hive, Trino, and StarRocks
  • Stores data in columnar formats on S3, HDFS, or any Hadoop-compatible filesystem

Architecture Overview

Paimon organizes data into snapshots, each consisting of a manifest list that points to data files and changelog files. Primary key tables use an LSM-tree structure where incoming records are buffered in sorted runs and merged during compaction. Changelog files record INSERT, UPDATE, and DELETE operations, enabling downstream consumers to process only incremental changes. The catalog layer integrates with Hive Metastore or a filesystem catalog for metadata management.

Self-Hosting & Configuration

  • Add the Paimon connector JAR to your Flink or Spark installation
  • Create a Paimon catalog pointing to an S3, HDFS, or local warehouse directory
  • Configure a Hive Metastore catalog for cross-engine metadata sharing
  • Tune compaction settings based on write throughput and query latency requirements
  • Set up snapshot expiration policies to manage storage growth

Key Features

  • Unified batch and streaming storage with changelog tracking for incremental reads
  • LSM-tree-based primary key tables with efficient upsert and partial update support
  • Time travel queries via snapshot management for reproducible analytics
  • Schema evolution support including column additions and type widening
  • Cross-engine compatibility with Flink, Spark, Hive, Trino, and StarRocks

Comparison with Similar Tools

  • Delta Lake — Spark-centric with batch focus; Paimon is designed for Flink streaming-first workloads
  • Apache Iceberg — general-purpose table format; Paimon adds native changelog tracking for CDC
  • Apache Hudi — supports incremental processing; Paimon uses LSM-trees for higher streaming write throughput
  • Apache Kafka — streaming transport; Paimon provides persistent lake storage with SQL query support
  • ClickHouse — OLAP engine; Paimon is a storage format that feeds into multiple query engines

FAQ

Q: How does Paimon differ from Apache Iceberg? A: Paimon is optimized for streaming writes with built-in changelog tracking via LSM-trees, while Iceberg focuses on batch-oriented table management. Paimon natively produces changelogs for incremental downstream processing.

Q: Can I query Paimon tables with Spark? A: Yes. Paimon provides Spark connectors for both reading and writing. Tables created by Flink are fully readable by Spark and vice versa.

Q: What storage backends does Paimon support? A: Paimon supports S3, HDFS, Azure Blob Storage, Google Cloud Storage, OSS, and local filesystems via the Hadoop FileSystem interface.

Q: Is Paimon production-ready? A: Apache Paimon graduated as a top-level Apache project and is used in production by organizations processing streaming data at scale.

Sources

Fil de discussion

Connectez-vous pour rejoindre la discussion.
Aucun commentaire pour l'instant. Soyez le premier à partager votre avis.

Actifs similaires