What Polars Does
- Eager and lazy evaluation — choose per query
- Query optimization — predicate pushdown, projection pushdown, common subexpression elimination
- Multi-threaded — parallel execution on all cores
- Arrow-native — Apache Arrow columnar format, zero-copy
- Streaming — process larger-than-RAM datasets
- Expressions — composable, type-safe column expressions
- IO — CSV, Parquet, JSON, Arrow IPC, Avro, databases, cloud storage (S3, GCS, Azure)
- SQL interface —
pl.SQLContextfor SQL queries on DataFrames - Group by — fast aggregation with rich expression API
- Window functions — rolling, expanding, partition-based
Architecture
Rust core with Python bindings via PyO3. Lazy mode builds a logical plan → optimizer → physical plan → parallel execution. Data stored in Apache Arrow chunked arrays for cache-friendly, SIMD-accelerated operations.
Comparison
| Library | Language | Speed | Lazy | Memory |
|---|---|---|---|---|
| Polars | Rust + Python | Fastest | Yes | Arrow |
| pandas | Python (C ext) | Slow | No | NumPy |
| Spark DataFrame | Scala/Python | Fast (distributed) | Yes | JVM |
| DuckDB | C++ | Very fast | Yes | Columnar |
| Vaex | C++ + Python | Fast | Yes | Memory-mapped |
常见问题 FAQ
Q: Polars vs pandas? A: Polars 在几乎所有 benchmark 上快 5-100 倍(Rust 多线程 vs Python 单线程)。API 不兼容但 Polars 的 expression API 更一致、更不容易踩坑。新项目推荐 Polars。
Q: 能处理多大数据? A: Lazy + streaming 模式可以处理远超内存的数据集。单机 TB 级 Parquet 文件没问题。
Q: 和 DuckDB 比? A: Polars 是 DataFrame 库(Python API 为主),DuckDB 是 SQL 数据库引擎。两者都很快,可以互补使用。
来源与致谢 Sources
- Docs: https://docs.pola.rs
- GitHub: https://github.com/pola-rs/polars
- License: MIT