dlt — Data Load Tool for Python ELT Pipelines
dlt (data load tool) is an open-source Python library that simplifies building ELT pipelines. Define a source as a Python generator, pick a destination, and dlt handles schema inference, incremental loading, normalization, and state management automatically.
What it is
dlt (data load tool) is an open-source Python library that simplifies building ELT pipelines. You define a data source as a Python generator or function, choose a destination (BigQuery, Snowflake, DuckDB, PostgreSQL, and others), and dlt handles schema inference, incremental loading, data normalization, and state management automatically.
dlt targets data engineers and analysts who want to build production data pipelines in pure Python without learning a new framework or YAML DSL. It is lightweight, embeddable, and works in scripts, notebooks, and orchestrators alike.
How it saves time or tokens
dlt eliminates the boilerplate of data loading: schema creation, type mapping, incremental state tracking, and nested JSON flattening. A pipeline that would take hundreds of lines with raw SQL and API calls becomes a few lines of Python. The automatic schema inference means you do not need to pre-define table schemas; dlt creates and evolves them based on the data it sees.
For AI workflows, dlt makes it easy to load API responses from LLM providers, vector databases, or analytics services into a data warehouse for analysis and reporting.
How to use
- Install dlt with your destination:
pip install dlt[bigquery](ordlt[duckdb],dlt[snowflake], etc.). - Define a source function that yields data. Use
@dlt.resourceto mark it as a loadable resource. - Create a pipeline, connect it to your destination, and run it. dlt infers the schema, creates tables, and loads the data.
Example
import dlt
import requests
@dlt.resource(write_disposition='merge', primary_key='id')
def github_issues():
response = requests.get(
'https://api.github.com/repos/dlt-hub/dlt/issues',
params={'state': 'open', 'per_page': 100}
)
yield response.json()
pipeline = dlt.pipeline(
pipeline_name='github_pipeline',
destination='duckdb',
dataset_name='github_data'
)
load_info = pipeline.run(github_issues)
print(load_info)
This pipeline fetches GitHub issues, creates a DuckDB table with inferred schema, and merges new data on subsequent runs using the id primary key.
Related on TokRepo
- AI tools for database — Data infrastructure and database tools
- Automation tools — Pipeline orchestration and automation
Common pitfalls
- Schema inference works well for consistent data shapes. Highly variable JSON structures may produce wide tables with many nullable columns. Use
dlt.resourcehints to control column selection. - Incremental loading requires a cursor field (like
updated_at). Without it, dlt loads all data on each run. Setincremental=dlt.sources.incremental('updated_at')to enable incremental behavior. - dlt pipelines run in-process by default. For large-scale production workloads, run them inside an orchestrator (Dagster, Airflow, Prefect) for scheduling, monitoring, and retry handling.
Frequently Asked Questions
dlt supports BigQuery, Snowflake, DuckDB, PostgreSQL, Redshift, Databricks, MotherDuck, Synapse, filesystem (Parquet/CSV), and others. Each destination is installed as a separate Python package and handles connection management, schema creation, and data type mapping.
dlt detects schema changes automatically. New columns are added to existing tables. Column type changes are handled according to configurable evolution policies. You can choose to discard new columns, evolve the schema, or raise an error on schema drift.
Yes. Use the incremental parameter on a resource to specify a cursor field (e.g., updated_at). dlt tracks the last loaded value and only fetches new records on subsequent runs. This works with both API sources and database extractions.
Airbyte is a platform with pre-built connectors and a web UI. dlt is a Python library where you write source logic in code. dlt is more flexible and lightweight but requires writing Python. Airbyte provides 300+ ready-made connectors with no coding needed.
Yes. dlt is designed to work in notebooks. You can define sources, run pipelines, and inspect results interactively. The DuckDB destination is especially convenient for notebook workflows since it requires no external database setup.
Citations (3)
- dlt GitHub— dlt is an open-source Python library for ELT pipelines
- dlt Documentation— Automatic schema inference, incremental loading, and normalization
- dlt Destinations— Supports BigQuery, Snowflake, DuckDB, and other destinations
Related on TokRepo
Discussion
Related Assets
Conda — Cross-Platform Package and Environment Manager
Install, update, and manage packages and isolated environments for Python, R, C/C++, and hundreds of other languages from a single tool.
Sphinx — Python Documentation Generator
Generate professional documentation from reStructuredText and Markdown with cross-references, API autodoc, and multiple output formats.
Neutralinojs — Lightweight Cross-Platform Desktop Apps
Build desktop applications with HTML, CSS, and JavaScript using a tiny native runtime instead of bundling Chromium.