ScriptsApr 16, 2026·3 min read

Presto — Distributed SQL Engine for Interactive Analytics

Facebook-born distributed SQL engine for running fast, interactive queries against data lakes, warehouses, and federated sources.

TL;DR
Presto runs fast interactive SQL queries against data lakes, warehouses, and federated sources at petabyte scale.
§01

What it is

Presto is an open-source distributed SQL query engine originally developed at Facebook (now Meta). It runs interactive analytical queries against data sources ranging from HDFS and S3 data lakes to relational databases, Kafka streams, and Elasticsearch indexes. A single Presto query can join data across multiple sources without moving it.

Presto targets data engineers, analysts, and platform teams who need fast ad-hoc queries on large datasets. It separates compute from storage, meaning you do not need to load data into a separate analytical database first.

§02

How it saves time or tokens

Without Presto, querying across multiple data sources requires ETL pipelines to move data into a single warehouse. Presto federates queries directly against the source systems, eliminating data movement. Queries that would take minutes in batch-oriented engines like Hive return results in seconds through Presto's in-memory execution model.

§03

How to use

  1. Deploy Presto with a coordinator and one or more worker nodes.
  2. Configure connectors for your data sources (S3, MySQL, PostgreSQL, etc.).
  3. Run SQL queries via the Presto CLI, JDBC, or any SQL client.
# Start Presto CLI
presto --server localhost:8080 --catalog hive --schema default

# Query data in S3 via Hive connector
SELECT date, count(*) as events
FROM hive.analytics.events
WHERE date >= DATE '2026-04-01'
GROUP BY date
ORDER BY date;

# Federated query: join S3 data with MySQL
SELECT u.name, count(e.event_id)
FROM hive.analytics.events e
JOIN mysql.app.users u ON e.user_id = u.id
GROUP BY u.name
ORDER BY count(e.event_id) DESC
LIMIT 10;
§04

Example

-- Create a table backed by S3 Parquet files
CREATE TABLE hive.analytics.pageviews (
  page_url VARCHAR,
  user_id BIGINT,
  timestamp TIMESTAMP
)
WITH (
  format = 'PARQUET',
  external_location = 's3://my-bucket/pageviews/'
);

-- Query it immediately
SELECT page_url, count(*) as views
FROM hive.analytics.pageviews
WHERE timestamp >= TIMESTAMP '2026-04-01 00:00:00'
GROUP BY page_url
ORDER BY views DESC
LIMIT 20;
§05

Related on TokRepo

§06

Common pitfalls

  • Presto is not designed for ETL or long-running batch jobs. Queries that process terabytes of data may hit memory limits on workers. Use appropriate partitioning and predicate pushdown.
  • The Hive connector requires a Hive Metastore service for table metadata, even if you do not use Hive for processing. This is an extra component to deploy and maintain.
  • Federated queries across slow connectors (e.g., REST APIs) can bottleneck the entire query. Materialize slow sources into a fast store before joining.

Frequently Asked Questions

What is the difference between Presto and Trino?+

Trino is a fork of Presto created by the original Presto creators after they left Facebook. Both projects share the same core architecture. Trino has a more active open-source community and faster release cadence. Presto continues to be developed by Meta. Feature sets are similar but diverging over time.

Can Presto replace my data warehouse?+

Presto can serve as a query layer on top of your data lake, reducing the need for a separate warehouse for ad-hoc analytics. However, it does not provide built-in storage, indexing, or data management features that warehouses like Snowflake or BigQuery offer. It complements rather than replaces a warehouse.

What data sources can Presto connect to?+

Presto has connectors for HDFS, S3, MySQL, PostgreSQL, MongoDB, Elasticsearch, Kafka, Redis, Cassandra, Google Sheets, and many more. Custom connectors can be built using the Presto SPI (Service Provider Interface).

How does Presto handle large queries?+

Presto distributes query execution across worker nodes. Each worker processes a partition of the data in parallel. For very large queries, you can add more workers to increase parallelism. Memory-intensive queries may require spill-to-disk configuration to avoid out-of-memory failures.

Is Presto suitable for real-time queries?+

Presto is designed for interactive (seconds-level) latency on analytical queries. It is not a real-time streaming engine. For sub-second latency, consider a streaming database. Presto excels at ad-hoc exploratory queries where response times of 1-30 seconds are acceptable.

Citations (3)

Discussion

Sign in to join the discussion.
No comments yet. Be the first to share your thoughts.

Related Assets