Introduction
Presto (PrestoDB) is a distributed SQL query engine originally developed at Facebook for running fast, interactive analytical queries against petabyte-scale datasets. Unlike traditional warehouses, Presto separates compute from storage and federates many sources — Hive, Iceberg, MySQL, Postgres, Cassandra, Kafka — through pluggable connectors.
What Presto Does
- Executes ANSI SQL across heterogeneous data stores in one query.
- Runs interactive, low-latency queries on HDFS/S3 data lakes.
- Federates warehouses and operational databases using 20+ connectors.
- Supports CTAS/INSERT into lakehouse tables (Iceberg, Hudi, Delta via extensions).
- Exposes REST + JDBC/ODBC endpoints for BI tools like Tableau and Superset.
Architecture Overview
A Presto cluster has one coordinator and many workers. The coordinator parses SQL, plans distributed execution, and splits scans across workers that stream data pages through a pipelined execution model. Connectors adapt external sources to Presto's type system, while the discovery service tracks active workers. Memory is managed with per-node and query-level pools to keep the cluster stable under heavy concurrency.
Self-Hosting & Configuration
- Packaged as
presto-servertarball; deploy via Ansible, Kubernetes, or EMR. - Key config files:
config.properties,node.properties,jvm.config, and per-catalog*.properties. - Tune
query.max-memory,query.max-memory-per-node, andtask.concurrencyper workload. - Use
hive.metastore-urito point at a Hive Metastore, AWS Glue, or REST catalog. - Secure with Kerberos, TLS, LDAP/OAuth2, and system + Ranger authorizers.
Key Features
- Interactive SQL latency on lake-scale data (seconds to minutes).
- Rich connector ecosystem for federated analytics without ETL.
- Cost-based optimizer with statistics from Hive, Iceberg, and Glue.
- Materialized views and caching (with plugins like Presto cache and Alluxio).
- Strong SQL: window functions, lambdas, geospatial, and approximate aggregates.
Comparison with Similar Tools
- Trino — A community fork of Presto with faster feature iteration; PrestoDB stays aligned with Meta/LinkedIn.
- Apache Spark SQL — Batch-centric with streaming; Presto is tuned for interactive latency.
- Apache Drill — Schema-on-read file queries; Presto offers richer connectors and optimizer.
- ClickHouse — Columnar OLAP DB you own; Presto federates across many existing stores instead.
- BigQuery / Athena — Managed cloud analytics; Presto is the open engine you can self-host everywhere.
FAQ
Q: What is the difference between Presto and Trino? A: Same origin; Trino (formerly PrestoSQL) is an active community fork, while PrestoDB evolves at Meta, LinkedIn, and the Linux Foundation's Presto Foundation.
Q: Does Presto store data? A: No — it is a query engine only. Storage lives in S3, HDFS, Hive, Iceberg, RDBMS, and other catalogs.
Q: How many nodes do I need? A: A small cluster of 3–5 workers handles GB-scale analytics; large deployments run hundreds of workers.
Q: What clients can connect?
A: JDBC, ODBC, Python (presto-python-client), Node.js, Go drivers, and the presto CLI.