How do I install Trino — Fast Distributed SQL Query Engine for Data Lakes?

Visit the asset page on TokRepo and click "Copy for agent" to get the installation instructions. Most assets can be installed with a single command.

Trino — Fast Distributed SQL Query Engine for Data Lakes

Introduction

Trino is a high-performance distributed SQL engine built for interactive analytics across heterogeneous data sources. It is the project that split from PrestoSQL in 2020 and is used at Meta, Netflix, LinkedIn, Pinterest, Shopify and many others to query petabytes of data with ANSI SQL.

What Trino Does

Executes standard ANSI SQL across 60+ connectors (Hive, Iceberg, Delta, Hudi, Kafka, Kudu, MySQL, Postgres, Cassandra, Redis, Elasticsearch, S3, GCS, Oracle, MongoDB...).
Pushes predicates, projections and aggregates into source systems when possible.
Runs distributed joins, window functions, CTEs, recursive queries and SQL-native JSON.
Supports row- and column-level access control with Ranger, OPA or file rules.
Provides a REST API and JDBC/ODBC/Python/Go clients for BI and apps.

Architecture Overview

A coordinator parses and plans queries; stages of tasks run on a cluster of workers that exchange pages of columnar data between each other. Each connector provides metadata, splits and page sources so the engine can stream data in parallel. Trino is purely an execution engine — it stores nothing itself, so you can spin it up and down freely and scale horizontally on Kubernetes, EC2 or bare metal.

Self-Hosting & Configuration

Launch with Helm (trino/trino), Docker, or tarball install of one coordinator + N workers.
Declare connectors in etc/catalog/*.properties — e.g. iceberg.properties, postgres.properties.
Tune query.max-memory-per-node, task.writer-count and spill-to-disk for big joins.
Hook up auth: password file, LDAP, OAuth2, Kerberos or JWT; add impersonation rules.
Deploy Trino Gateway to route queries across multiple clusters or versions.

Key Features

Interactive latency (seconds) over PB-scale lakes thanks to columnar pipelined execution.
First-class Apache Iceberg and Delta Lake support: time-travel, schema evolution, MERGE INTO.
Fault-tolerant execution with exchange-manager when queries run for hours.
Full ANSI SQL — window functions, CTEs, geospatial, JSON, array and map types.
Dynamic filtering, cost-based optimiser and adaptive join reordering.

Comparison with Similar Tools

Presto (PrestoDB) — the sibling project at Meta; similar but different governance.
Apache Spark SQL — great for ETL; Trino is faster for interactive BI.
Dremio — commercial lakehouse query engine with Arrow and reflections.
StarRocks / Doris — MPP databases; they store data, Trino does not.
BigQuery / Snowflake — managed warehouses; Trino is the open-source federated alternative.

FAQ

Q: Trino vs Presto? A: Both descend from the same codebase. Trino is the original team's fork; Presto is Meta's continuation. Q: Can Trino handle ETL as well as BI? A: Yes, especially with fault-tolerant execution enabled for long queries. Q: Do I need Hive Metastore? A: For Hive and some Iceberg setups, yes — or use Iceberg REST catalog, Glue, Nessie or Unity. Q: Is Trino free for commercial use? A: Yes — Apache 2.0; commercial support is available from Starburst and others.

Trino — Fast Distributed SQL Query Engine for Data Lakes

Introduction

What Trino Does

Architecture Overview

Self-Hosting & Configuration

Key Features

Comparison with Similar Tools

FAQ

Sources

讨论

相关资产

Apache Iceberg — Open Table Format for Huge Analytical Datasets

RisingWave — Cloud-Native Streaming Database in Rust

NebulaGraph — Distributed Open-Source Graph Database