# xsv — Fast CSV Toolkit Written in Rust > xsv is a blazing-fast command-line toolkit for working with CSV data. It provides indexing, slicing, searching, joining, aggregation, and statistics — processing millions of rows per second for data analysis, ETL pipelines, and CSV manipulation. ## Install Save in your project root: # xsv — Fast CSV Toolkit Written in Rust ## Quick Use ```bash # Install xsv brew install xsv # Or: cargo install xsv # View CSV structure xsv headers data.csv # Show first 10 rows xsv slice -l 10 data.csv | xsv table # Select specific columns xsv select name,email,age data.csv # Search/filter rows xsv search -s status "active" data.csv # Statistics for all columns xsv stats data.csv | xsv table # Sort by a column xsv sort -s revenue -R data.csv ``` ## Introduction xsv is a command-line toolkit for CSV that does what pandas does — but at the speed of Rust and without loading data into memory. It processes CSV files with millions of rows in seconds, providing operations for selection, filtering, joining, aggregation, and statistics. With over 11,000 GitHub stars, xsv was created by Andrew Gallant (also the creator of ripgrep). It is the go-to tool for anyone who works with CSV data in the terminal and needs performance that awk, cut, and Python scripts cannot match. ## What xsv Does xsv provides a suite of subcommands for CSV manipulation: headers (show column names), select (pick columns), search (filter rows by regex), sort, join (SQL-like joins between CSVs), stats (column statistics), frequency (value distributions), and more — all optimized for speed with streaming processing. ## Architecture Overview ``` [CSV Input] Stdin, file, or multiple files | [xsv Subcommands] +-------+-------+-------+ | | | | [select] [search] [stats] Pick Filter Min, max, columns by regex mean, stdev [sort] [join] [frequency] Order SQL-like Value by column inner/ distributions outer join [slice] [split] [fmt] Row Split Reformat ranges into delimiter chunks alignment | [Streaming Processing] Processes rows without loading entire file into memory | [CSV Output] Stdout, file, or pipe ``` ## Self-Hosting & Configuration ```bash # Data exploration workflow # 1. Understand the data xsv headers sales.csv # date,product,category,revenue,quantity,region xsv stats sales.csv | xsv table # Shows type, min, max, mean, stddev for each column # 2. Filter and select xsv search -s region "US" sales.csv | xsv select product,revenue,quantity > us_sales.csv # 3. Sort and slice xsv sort -s revenue -R sales.csv | xsv slice -l 20 | xsv table # Top 20 rows by revenue # 4. Frequency analysis xsv frequency -s category sales.csv | xsv table # Shows value counts for category column # 5. Join two CSVs xsv join product sales.csv product_id products.csv > enriched.csv # 6. Split large file xsv split -s 10000 output_dir/ large_file.csv # Creates chunks of 10,000 rows each # 7. Count rows xsv count sales.csv # 8. Index for faster operations xsv index sales.csv # creates sales.csv.idx xsv slice -i 1000000 -l 100 sales.csv # instant random access # Pipeline example xsv search -s status "completed" orders.csv \ | xsv select customer_id,amount \ | xsv sort -s amount -R \ | xsv slice -l 10 \ | xsv table ``` ## Key Features - **Blazing Fast** — processes millions of rows per second in Rust - **Streaming** — works with files larger than available RAM - **Select** — pick columns by name or index - **Search** — filter rows by regex on any column - **Sort** — sort by any column (numeric or lexicographic) - **Join** — inner, outer, left, right joins between CSV files - **Stats** — min, max, mean, median, stddev for all columns - **Frequency** — value distribution counts for categorical columns ## Comparison with Similar Tools | Feature | xsv | csvkit | Miller (mlr) | cut + awk | pandas (Python) | |---|---|---|---|---|---| | Language | Rust | Python | C | C (coreutils) | Python | | Speed | Very Fast | Slow | Fast | Moderate | Moderate | | Memory | Streaming | In-memory | Streaming | Streaming | In-memory | | CSV + JSON | CSV only | CSV + more | CSV + JSON | Text only | Any format | | Statistics | Built-in | Via csvstat | Built-in | Manual | Built-in | | Joins | Yes | Yes | Yes | No | Yes | | Best For | Large CSV processing | Python users | Multi-format | Simple tasks | Full analysis | ## FAQ **Q: xsv vs Miller (mlr) — which should I choose?** A: xsv for pure CSV processing with maximum speed. Miller for multi-format support (CSV, JSON, JSONL) and more transformation capabilities. xsv is faster; Miller is more versatile. **Q: Can xsv handle files larger than RAM?** A: Yes. xsv uses streaming processing for most operations. For operations that need random access (like sort), create an index first with "xsv index". **Q: How do I change the delimiter?** A: Use -d flag: "xsv stats -d '\t' data.tsv" for tab-separated files. Output delimiter is set with --output-delimiter. **Q: Can xsv replace pandas for data analysis?** A: For simple operations (filter, select, sort, join, stats), xsv is faster and uses less memory. For complex analysis (pivot tables, groupby with custom aggregations, plotting), pandas is more capable. ## Sources - GitHub: https://github.com/BurntSushi/xsv - Created by Andrew Gallant (BurntSushi, also created ripgrep) - License: Unlicense / MIT --- Source: https://tokrepo.com/en/workflows/82f0e8a4-3745-11f1-9bc6-00163e2b0d79 Author: AI Open Source