Configs2026年4月13日·1 分钟阅读

xsv — Fast CSV Toolkit Written in Rust

xsv is a blazing-fast command-line toolkit for working with CSV data. It provides indexing, slicing, searching, joining, aggregation, and statistics — processing millions of rows per second for data analysis, ETL pipelines, and CSV manipulation.

AI
AI Open Source · Community
快速使用

先拿来用,再决定要不要深挖

这里应该同时让用户和 Agent 知道第一步该复制什么、安装什么、落到哪里。

# Install xsv
brew install xsv
# Or: cargo install xsv

# View CSV structure
xsv headers data.csv

# Show first 10 rows
xsv slice -l 10 data.csv | xsv table

# Select specific columns
xsv select name,email,age data.csv

# Search/filter rows
xsv search -s status "active" data.csv

# Statistics for all columns
xsv stats data.csv | xsv table

# Sort by a column
xsv sort -s revenue -R data.csv

Introduction

xsv is a command-line toolkit for CSV that does what pandas does — but at the speed of Rust and without loading data into memory. It processes CSV files with millions of rows in seconds, providing operations for selection, filtering, joining, aggregation, and statistics.

With over 11,000 GitHub stars, xsv was created by Andrew Gallant (also the creator of ripgrep). It is the go-to tool for anyone who works with CSV data in the terminal and needs performance that awk, cut, and Python scripts cannot match.

What xsv Does

xsv provides a suite of subcommands for CSV manipulation: headers (show column names), select (pick columns), search (filter rows by regex), sort, join (SQL-like joins between CSVs), stats (column statistics), frequency (value distributions), and more — all optimized for speed with streaming processing.

Architecture Overview

[CSV Input]
Stdin, file, or multiple files
        |
   [xsv Subcommands]
+-------+-------+-------+
|       |       |       |
[select] [search] [stats]
Pick     Filter   Min, max,
columns  by regex mean, stdev

[sort]   [join]   [frequency]
Order    SQL-like  Value
by column inner/   distributions
         outer join

[slice]  [split]  [fmt]
Row      Split    Reformat
ranges   into     delimiter
         chunks   alignment
        |
   [Streaming Processing]
   Processes rows without
   loading entire file
   into memory
        |
[CSV Output]
Stdout, file, or pipe

Self-Hosting & Configuration

# Data exploration workflow

# 1. Understand the data
xsv headers sales.csv
# date,product,category,revenue,quantity,region

xsv stats sales.csv | xsv table
# Shows type, min, max, mean, stddev for each column

# 2. Filter and select
xsv search -s region "US" sales.csv | xsv select product,revenue,quantity > us_sales.csv

# 3. Sort and slice
xsv sort -s revenue -R sales.csv | xsv slice -l 20 | xsv table
# Top 20 rows by revenue

# 4. Frequency analysis
xsv frequency -s category sales.csv | xsv table
# Shows value counts for category column

# 5. Join two CSVs
xsv join product sales.csv product_id products.csv > enriched.csv

# 6. Split large file
xsv split -s 10000 output_dir/ large_file.csv
# Creates chunks of 10,000 rows each

# 7. Count rows
xsv count sales.csv

# 8. Index for faster operations
xsv index sales.csv  # creates sales.csv.idx
xsv slice -i 1000000 -l 100 sales.csv  # instant random access

# Pipeline example
xsv search -s status "completed" orders.csv \
  | xsv select customer_id,amount \
  | xsv sort -s amount -R \
  | xsv slice -l 10 \
  | xsv table

Key Features

  • Blazing Fast — processes millions of rows per second in Rust
  • Streaming — works with files larger than available RAM
  • Select — pick columns by name or index
  • Search — filter rows by regex on any column
  • Sort — sort by any column (numeric or lexicographic)
  • Join — inner, outer, left, right joins between CSV files
  • Stats — min, max, mean, median, stddev for all columns
  • Frequency — value distribution counts for categorical columns

Comparison with Similar Tools

Feature xsv csvkit Miller (mlr) cut + awk pandas (Python)
Language Rust Python C C (coreutils) Python
Speed Very Fast Slow Fast Moderate Moderate
Memory Streaming In-memory Streaming Streaming In-memory
CSV + JSON CSV only CSV + more CSV + JSON Text only Any format
Statistics Built-in Via csvstat Built-in Manual Built-in
Joins Yes Yes Yes No Yes
Best For Large CSV processing Python users Multi-format Simple tasks Full analysis

FAQ

Q: xsv vs Miller (mlr) — which should I choose? A: xsv for pure CSV processing with maximum speed. Miller for multi-format support (CSV, JSON, JSONL) and more transformation capabilities. xsv is faster; Miller is more versatile.

Q: Can xsv handle files larger than RAM? A: Yes. xsv uses streaming processing for most operations. For operations that need random access (like sort), create an index first with "xsv index".

Q: How do I change the delimiter? A: Use -d flag: "xsv stats -d '\t' data.tsv" for tab-separated files. Output delimiter is set with --output-delimiter.

Q: Can xsv replace pandas for data analysis? A: For simple operations (filter, select, sort, join, stats), xsv is faster and uses less memory. For complex analysis (pivot tables, groupby with custom aggregations, plotting), pandas is more capable.

Sources

讨论

登录后参与讨论。
还没有评论,来写第一条吧。

相关资产