Scripts2026年4月15日·1 分钟阅读

Miller — Like awk, sed, cut, join, and sort for CSV, TSV, JSON

Miller (mlr) is a multi-purpose command-line tool for processing name-indexed data such as CSV, TSV, JSON, JSON Lines, and positionally-indexed records, blending awk-style expressions with pandas-like DataFrame operations.

Introduction

Miller (mlr) is a single-binary Go tool that lets you treat CSV/TSV/JSON as first-class structured data from the shell. Instead of juggling cut, awk, and jq, Miller provides a unified verb grammar — cat, head, tail, filter, put, stats1, join, reshape — that works across formats and converts between them with a flag change.

What Miller Does

  • Reads/writes CSV, TSV, JSON, JSON Lines, PPRINT, NIDX, DKVP, Markdown.
  • Provides verbs (filter, put, stats1, join, sort, tac, reshape wide→long).
  • Supports a DSL with variables, functions, control flow, and regex.
  • Streams data row-by-row — handles files larger than RAM.
  • Operates as UNIX filter — composes naturally with pipes.

Architecture Overview

Miller parses the input stream into record objects (ordered maps of field → value), passes them through a verb chain, and emits them in the chosen output format. Verbs are stackable; the DSL compiles once and runs per record. There is no intermediate DataFrame — memory is constant for most operations except sort/join.

Self-Hosting & Configuration

  • Install via Homebrew, apt, dnf, Chocolatey, or download a static Go binary from the GitHub releases page.
  • Zero config; behavior driven by flags: --icsv --ojson converts CSV→JSON.
  • Put reusable pipelines in .mlrrc to shorten repeated commands.
  • Can run as AWS Lambda layer for data-prep in serverless ETL.

Key Features

  • One tool for CSV/TSV/JSON/DKVP/PPRINT — replaces 4–5 utilities.
  • Streaming architecture with constant memory for most verbs.
  • DSL rich enough for regex, dates, JSON paths, higher-order functions.
  • tac, nest, unsparsify, reshape cover edge-case transforms.
  • Written in Go: single static binary, no runtime dependencies.

Comparison with Similar Tools

  • csvkit — Python-based, more command-per-verb; slower on big files.
  • xsv — Rust CSV tool; very fast but CSV-only and no DSL.
  • jq — JSON-only; unmatched for JSON but cannot read CSV.
  • q — runs SQL over CSV/TSV; great for SQL fans but no streaming reshape.
  • duckdb CLI — columnar SQL; heavier for small one-off pipelines.

FAQ

Q: Can Miller handle multi-GB files? A: Yes, streaming verbs use constant memory. sort/join buffer.

Q: Is the DSL Turing-complete? A: Effectively yes — variables, functions, loops, conditionals.

Q: Will it infer column types? A: Numbers auto-typed in arithmetic; strings otherwise. Use asserting_int to enforce.

Q: Does Miller support Parquet? A: Not natively — pair with duckdb CLI or convert via mlr --ocsv.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产