Trino — Fast Distributed SQL Query Engine for Data Lakes
The federated SQL engine formerly known as PrestoSQL. Query S3/HDFS/Iceberg/Delta/Hudi, MySQL, Postgres, Kafka, Cassandra and dozens more with ANSI SQL — in seconds, at petabyte scale.
Ready-to-run agent install
This asset can be installed after the agent chooses its runtime, checks the plan, and runs the matching command.
npx -y tokrepo@latest install 976e6a2f-3920-11f1-9bc6-00163e2b0d79 --target codexRun after dry-run confirms the install plan.
What it is
Trino (formerly PrestoSQL) is a fast distributed SQL query engine designed for interactive analytics on data lakes and federated data sources. It executes ANSI SQL queries across Amazon S3, Apache Iceberg, Delta Lake, Hudi, MySQL, PostgreSQL, Kafka, Cassandra, and dozens of other connectors.
Trino targets data engineers and analysts who need to query data wherever it lives without moving it into a single warehouse. A single Trino query can join tables from S3 and PostgreSQL in real time.
How it saves time or tokens
Trino eliminates the need for ETL pipelines that copy data between systems. Query data in place using standard SQL, reducing data movement costs and latency. Interactive queries return results in seconds rather than the minutes or hours typical of batch ETL.
For AI data pipelines, Trino provides a single SQL interface to all data sources. LLMs can generate standard SQL without needing to know the underlying storage format.
How to use
- Deploy Trino with Docker:
docker run -d -p 8080:8080 --name trino trinodb/trino
- Connect with the Trino CLI:
trino --server localhost:8080 --catalog hive --schema default
- Query data across sources:
SELECT o.order_id, c.name, o.total
FROM hive.sales.orders o
JOIN postgresql.public.customers c ON o.customer_id = c.id
WHERE o.created_at > DATE '2026-01-01'
ORDER BY o.total DESC
LIMIT 10;
- Configure catalogs for each data source in
etc/catalog/directory.
Example
-- Query Iceberg tables on S3
SELECT date_trunc('month', event_time) AS month,
count(*) AS events,
count(DISTINCT user_id) AS users
FROM iceberg.analytics.events
WHERE event_time >= DATE '2026-01-01'
GROUP BY 1
ORDER BY 1;
Related on TokRepo
- AI Tools for Database — Database query tools and engines
- AI Tools for Automation — Data pipeline automation tools
Common pitfalls
- Not configuring memory limits properly. Trino queries can consume large amounts of memory for joins and aggregations. Set query memory limits and kill queries that exceed thresholds.
- Using Trino for transactional workloads. Trino is designed for analytical queries, not OLTP. It does not support UPDATE or DELETE on most connectors.
- Ignoring connector-specific optimizations. Each connector has different pushdown capabilities. Learn which filters and aggregations each connector can handle natively for better performance.
- Failing to review community discussions and changelogs before upgrading. Breaking changes in major versions can disrupt existing workflows. Pin versions in production and test upgrades in staging first.
Frequently Asked Questions
Trino is the continuation of PrestoSQL, created by the original Presto developers after they left Facebook. PrestoDB is a separate fork maintained by the Presto Foundation. Both share the same origins but have diverged. Trino has a more active community and faster release cadence.
Trino can serve as a query engine for lakehouse architectures, querying data on S3 with Iceberg or Delta Lake. For some workloads, this replaces traditional data warehouses. However, Trino does not manage data storage or optimize table layouts. Pair it with Iceberg for a complete lakehouse.
Trino uses a connector architecture. Each data source has a connector that translates SQL into source-native operations. A single query can reference tables from different connectors, and Trino handles the join across sources in memory.
Trino has connectors for Kafka and other streaming systems. You can query real-time data alongside historical data in the same SQL query. However, Trino is not a streaming engine; it executes point-in-time queries.
Trino scales horizontally by adding worker nodes. Production deployments handle petabytes of data and thousands of concurrent queries. Companies like Netflix, LinkedIn, and Lyft use Trino for large-scale analytics.
Citations (3)
- Trino GitHub— Trino is a fast distributed SQL query engine formerly known as PrestoSQL
- Trino Documentation— Trino documentation and connector reference
- Apache Iceberg— Apache Iceberg table format for data lakes
Related on TokRepo
Discussion
Related Assets
Arroyo — Distributed Stream Processing Engine in Rust
A Rust-based distributed stream processing engine that lets you write SQL or Rust pipelines for real-time data transformation over Kafka, Kinesis, and other sources.
Apache DataFusion — Fast In-Process SQL Query Engine in Rust
An extensible query engine written in Rust that uses Apache Arrow as its in-memory format, enabling fast analytical SQL queries embeddable in any application.
Apache Hive — Distributed Data Warehouse for Big Data Analytics
Apache Hive is a data warehouse system built on Hadoop that provides SQL-like querying (HiveQL) over large datasets stored in distributed storage. It translates SQL queries into MapReduce, Tez, or Spark jobs for scalable batch analytics.
Cadence — Distributed Workflow Execution Engine by Uber
Cadence is a distributed, scalable, fault-tolerant workflow orchestration engine developed by Uber for executing long-running business logic as durable, stateful workflows that survive process and infrastructure failures.