Introduction
xsv is a command-line toolkit for CSV that does what pandas does — but at the speed of Rust and without loading data into memory. It processes CSV files with millions of rows in seconds, providing operations for selection, filtering, joining, aggregation, and statistics.
With over 11,000 GitHub stars, xsv was created by Andrew Gallant (also the creator of ripgrep). It is the go-to tool for anyone who works with CSV data in the terminal and needs performance that awk, cut, and Python scripts cannot match.
What xsv Does
xsv provides a suite of subcommands for CSV manipulation: headers (show column names), select (pick columns), search (filter rows by regex), sort, join (SQL-like joins between CSVs), stats (column statistics), frequency (value distributions), and more — all optimized for speed with streaming processing.
Architecture Overview
[CSV Input]
Stdin, file, or multiple files
|
[xsv Subcommands]
+-------+-------+-------+
| | | |
[select] [search] [stats]
Pick Filter Min, max,
columns by regex mean, stdev
[sort] [join] [frequency]
Order SQL-like Value
by column inner/ distributions
outer join
[slice] [split] [fmt]
Row Split Reformat
ranges into delimiter
chunks alignment
|
[Streaming Processing]
Processes rows without
loading entire file
into memory
|
[CSV Output]
Stdout, file, or pipeSelf-Hosting & Configuration
# Data exploration workflow
# 1. Understand the data
xsv headers sales.csv
# date,product,category,revenue,quantity,region
xsv stats sales.csv | xsv table
# Shows type, min, max, mean, stddev for each column
# 2. Filter and select
xsv search -s region "US" sales.csv | xsv select product,revenue,quantity > us_sales.csv
# 3. Sort and slice
xsv sort -s revenue -R sales.csv | xsv slice -l 20 | xsv table
# Top 20 rows by revenue
# 4. Frequency analysis
xsv frequency -s category sales.csv | xsv table
# Shows value counts for category column
# 5. Join two CSVs
xsv join product sales.csv product_id products.csv > enriched.csv
# 6. Split large file
xsv split -s 10000 output_dir/ large_file.csv
# Creates chunks of 10,000 rows each
# 7. Count rows
xsv count sales.csv
# 8. Index for faster operations
xsv index sales.csv # creates sales.csv.idx
xsv slice -i 1000000 -l 100 sales.csv # instant random access
# Pipeline example
xsv search -s status "completed" orders.csv \
| xsv select customer_id,amount \
| xsv sort -s amount -R \
| xsv slice -l 10 \
| xsv tableKey Features
- Blazing Fast — processes millions of rows per second in Rust
- Streaming — works with files larger than available RAM
- Select — pick columns by name or index
- Search — filter rows by regex on any column
- Sort — sort by any column (numeric or lexicographic)
- Join — inner, outer, left, right joins between CSV files
- Stats — min, max, mean, median, stddev for all columns
- Frequency — value distribution counts for categorical columns
Comparison with Similar Tools
| Feature | xsv | csvkit | Miller (mlr) | cut + awk | pandas (Python) |
|---|---|---|---|---|---|
| Language | Rust | Python | C | C (coreutils) | Python |
| Speed | Very Fast | Slow | Fast | Moderate | Moderate |
| Memory | Streaming | In-memory | Streaming | Streaming | In-memory |
| CSV + JSON | CSV only | CSV + more | CSV + JSON | Text only | Any format |
| Statistics | Built-in | Via csvstat | Built-in | Manual | Built-in |
| Joins | Yes | Yes | Yes | No | Yes |
| Best For | Large CSV processing | Python users | Multi-format | Simple tasks | Full analysis |
FAQ
Q: xsv vs Miller (mlr) — which should I choose? A: xsv for pure CSV processing with maximum speed. Miller for multi-format support (CSV, JSON, JSONL) and more transformation capabilities. xsv is faster; Miller is more versatile.
Q: Can xsv handle files larger than RAM? A: Yes. xsv uses streaming processing for most operations. For operations that need random access (like sort), create an index first with "xsv index".
Q: How do I change the delimiter? A: Use -d flag: "xsv stats -d '\t' data.tsv" for tab-separated files. Output delimiter is set with --output-delimiter.
Q: Can xsv replace pandas for data analysis? A: For simple operations (filter, select, sort, join, stats), xsv is faster and uses less memory. For complex analysis (pivot tables, groupby with custom aggregations, plotting), pandas is more capable.
Sources
- GitHub: https://github.com/BurntSushi/xsv
- Created by Andrew Gallant (BurntSushi, also created ripgrep)
- License: Unlicense / MIT