ScriptsApr 16, 2026·3 min read

Presto — Distributed SQL Engine for Interactive Analytics

Facebook-born distributed SQL engine for running fast, interactive queries against data lakes, warehouses, and federated sources.

Introduction

Presto (PrestoDB) is a distributed SQL query engine originally developed at Facebook for running fast, interactive analytical queries against petabyte-scale datasets. Unlike traditional warehouses, Presto separates compute from storage and federates many sources — Hive, Iceberg, MySQL, Postgres, Cassandra, Kafka — through pluggable connectors.

What Presto Does

  • Executes ANSI SQL across heterogeneous data stores in one query.
  • Runs interactive, low-latency queries on HDFS/S3 data lakes.
  • Federates warehouses and operational databases using 20+ connectors.
  • Supports CTAS/INSERT into lakehouse tables (Iceberg, Hudi, Delta via extensions).
  • Exposes REST + JDBC/ODBC endpoints for BI tools like Tableau and Superset.

Architecture Overview

A Presto cluster has one coordinator and many workers. The coordinator parses SQL, plans distributed execution, and splits scans across workers that stream data pages through a pipelined execution model. Connectors adapt external sources to Presto's type system, while the discovery service tracks active workers. Memory is managed with per-node and query-level pools to keep the cluster stable under heavy concurrency.

Self-Hosting & Configuration

  • Packaged as presto-server tarball; deploy via Ansible, Kubernetes, or EMR.
  • Key config files: config.properties, node.properties, jvm.config, and per-catalog *.properties.
  • Tune query.max-memory, query.max-memory-per-node, and task.concurrency per workload.
  • Use hive.metastore-uri to point at a Hive Metastore, AWS Glue, or REST catalog.
  • Secure with Kerberos, TLS, LDAP/OAuth2, and system + Ranger authorizers.

Key Features

  • Interactive SQL latency on lake-scale data (seconds to minutes).
  • Rich connector ecosystem for federated analytics without ETL.
  • Cost-based optimizer with statistics from Hive, Iceberg, and Glue.
  • Materialized views and caching (with plugins like Presto cache and Alluxio).
  • Strong SQL: window functions, lambdas, geospatial, and approximate aggregates.

Comparison with Similar Tools

  • Trino — A community fork of Presto with faster feature iteration; PrestoDB stays aligned with Meta/LinkedIn.
  • Apache Spark SQL — Batch-centric with streaming; Presto is tuned for interactive latency.
  • Apache Drill — Schema-on-read file queries; Presto offers richer connectors and optimizer.
  • ClickHouse — Columnar OLAP DB you own; Presto federates across many existing stores instead.
  • BigQuery / Athena — Managed cloud analytics; Presto is the open engine you can self-host everywhere.

FAQ

Q: What is the difference between Presto and Trino? A: Same origin; Trino (formerly PrestoSQL) is an active community fork, while PrestoDB evolves at Meta, LinkedIn, and the Linux Foundation's Presto Foundation.

Q: Does Presto store data? A: No — it is a query engine only. Storage lives in S3, HDFS, Hive, Iceberg, RDBMS, and other catalogs.

Q: How many nodes do I need? A: A small cluster of 3–5 workers handles GB-scale analytics; large deployments run hundreds of workers.

Q: What clients can connect? A: JDBC, ODBC, Python (presto-python-client), Node.js, Go drivers, and the presto CLI.

Sources

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets