How do I install Presto — Distributed SQL Engine for Interactive Analytics?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Presto — Distributed SQL Engine for Interactive Analytics

Introduction

Presto (PrestoDB) is a distributed SQL query engine originally developed at Facebook for running fast, interactive analytical queries against petabyte-scale datasets. Unlike traditional warehouses, Presto separates compute from storage and federates many sources — Hive, Iceberg, MySQL, Postgres, Cassandra, Kafka — through pluggable connectors.

What Presto Does

Executes ANSI SQL across heterogeneous data stores in one query.
Runs interactive, low-latency queries on HDFS/S3 data lakes.
Federates warehouses and operational databases using 20+ connectors.
Supports CTAS/INSERT into lakehouse tables (Iceberg, Hudi, Delta via extensions).
Exposes REST + JDBC/ODBC endpoints for BI tools like Tableau and Superset.

Architecture Overview

A Presto cluster has one coordinator and many workers. The coordinator parses SQL, plans distributed execution, and splits scans across workers that stream data pages through a pipelined execution model. Connectors adapt external sources to Presto's type system, while the discovery service tracks active workers. Memory is managed with per-node and query-level pools to keep the cluster stable under heavy concurrency.

Self-Hosting & Configuration

Packaged as presto-server tarball; deploy via Ansible, Kubernetes, or EMR.
Key config files: config.properties, node.properties, jvm.config, and per-catalog *.properties.
Tune query.max-memory, query.max-memory-per-node, and task.concurrency per workload.
Use hive.metastore-uri to point at a Hive Metastore, AWS Glue, or REST catalog.
Secure with Kerberos, TLS, LDAP/OAuth2, and system + Ranger authorizers.

Key Features

Interactive SQL latency on lake-scale data (seconds to minutes).
Rich connector ecosystem for federated analytics without ETL.
Cost-based optimizer with statistics from Hive, Iceberg, and Glue.
Materialized views and caching (with plugins like Presto cache and Alluxio).
Strong SQL: window functions, lambdas, geospatial, and approximate aggregates.

Comparison with Similar Tools

Trino — A community fork of Presto with faster feature iteration; PrestoDB stays aligned with Meta/LinkedIn.
Apache Spark SQL — Batch-centric with streaming; Presto is tuned for interactive latency.
Apache Drill — Schema-on-read file queries; Presto offers richer connectors and optimizer.
ClickHouse — Columnar OLAP DB you own; Presto federates across many existing stores instead.
BigQuery / Athena — Managed cloud analytics; Presto is the open engine you can self-host everywhere.

FAQ

Q: What is the difference between Presto and Trino? A: Same origin; Trino (formerly PrestoSQL) is an active community fork, while PrestoDB evolves at Meta, LinkedIn, and the Linux Foundation's Presto Foundation.

Q: Does Presto store data? A: No — it is a query engine only. Storage lives in S3, HDFS, Hive, Iceberg, RDBMS, and other catalogs.

Q: How many nodes do I need? A: A small cluster of 3–5 workers handles GB-scale analytics; large deployments run hundreds of workers.

Q: What clients can connect? A: JDBC, ODBC, Python (presto-python-client), Node.js, Go drivers, and the presto CLI.

Presto — Distributed SQL Engine for Interactive Analytics

Introduction

What Presto Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

Discussion

Related Assets

Cortex — Horizontally Scalable Long-Term Storage for Prometheus

CUE — Validate, Define, and Generate Configuration with Types

Prometheus Operator — Kubernetes-Native Monitoring Stack Management