Introduction
Apache Parquet is a columnar file format that organizes data by column rather than by row. This layout enables analytical query engines to read only the columns they need, skip irrelevant row groups via predicate pushdown, and achieve high compression ratios through column-level encoding.
What Apache Parquet Does
- Stores tabular data in a columnar layout optimized for analytical read patterns
- Supports predicate pushdown and column pruning to minimize I/O
- Provides multiple encoding and compression options per column
- Handles complex nested data structures with Dremel-style repetition and definition levels
- Serves as the standard storage format for Spark, Hive, Flink, DuckDB, Polars, and many more
Architecture Overview
A Parquet file is divided into row groups, each containing column chunks. Each column chunk stores a sequence of data pages with optional dictionary, repetition-level, and definition-level pages for nested types. The file footer contains schema metadata, row group offsets, and column statistics (min, max, null count) used for predicate pushdown. Encoding schemes like dictionary encoding, run-length encoding, and delta encoding are applied per column based on data characteristics. Compression codecs (Snappy, Zstd, LZ4, Gzip) are applied per page.
Self-Hosting & Configuration
- Most analytics engines (Spark, DuckDB, Polars, pandas) read and write Parquet natively
- Use pyarrow or fastparquet for Python-based Parquet operations
- Configure row group size to balance between query parallelism and file overhead
- Select compression codecs per column: Zstd for balanced ratio/speed, Snappy for low latency
- Enable bloom filters in the file footer for selective row group skipping
Key Features
- Columnar layout reads only needed columns, reducing I/O for wide-table analytics
- Predicate pushdown uses column statistics to skip entire row groups
- Rich encoding options (dictionary, RLE, delta) achieve high compression without external codecs
- Nested schema support via Dremel encoding handles maps, lists, and structs
- Universal format supported by virtually every analytics engine and cloud data platform
Comparison with Similar Tools
- CSV — row-oriented text format; Parquet is binary, compressed, and orders of magnitude faster for analytics
- Avro — row-oriented binary format for streaming; Parquet is columnar for analytical queries
- ORC — Hive-native columnar format; Parquet has broader ecosystem adoption across engines
- Arrow IPC — in-memory columnar format; Parquet is designed for on-disk storage with compression
- JSON — human-readable but large and slow to parse; Parquet is compact and typed
FAQ
Q: When should I use Parquet instead of CSV? A: Use Parquet whenever you run analytical queries on large datasets. Parquet files are smaller, faster to read, and support schema enforcement and column pruning.
Q: Can Parquet handle nested data like JSON? A: Yes. Parquet supports complex nested types including structs, maps, and lists using Dremel-style encoding with repetition and definition levels.
Q: How does Parquet achieve compression? A: Parquet applies column-level encoding (dictionary, RLE, delta) first, then page-level compression (Zstd, Snappy, LZ4). Columns with repeated values compress very well.
Q: Is Parquet suitable for streaming workloads? A: Parquet is optimized for batch reads. For streaming, Avro or row-oriented formats are more suitable; Parquet is ideal as the landing format for periodic compaction from streaming sources.