This guide will help you get the best performance out of qsv for your data analysis workflows.
Think of indexing like creating a table of contents for your CSV files. It's the single most important thing you can do to improve performance. Here's why you want to use it:
- Makes slicing data nearly instant
- Gives you immediate row counts
- Enables parallel processing for commands like
stats,frequency, andsample - Adds random access capabilities for advanced features
- Takes very little time to create (example: a 520MB file with 1 million rows takes less than half a second)
Quick Setup:
# Automatically index files larger than 10MB
export QSV_AUTOINDEX_SIZE=10000000The stats cache is qsv's secret weapon for fast data analysis. When you run the stats command, qsv saves detailed information about your data that other commands can use to work smarter:
- Makes frequency tables faster by knowing which columns have unique values
- Helps create accurate JSON and SQL schemas without repeated analysis
- Enables smart pivoting by automatically choosing the right aggregation functions
- Speeds up data sampling and comparison operations
Pro Tip: Always run stats with the --stats-jsonl option on your frequently-used datasets to create this cache. Alternatively, you can also set QSV_STATSCACHE_MODE to "force" or "auto".
qsv is designed to handle large files efficiently, but some operations need to load entire files into memory. Here's what you need to know:
Commands that need full memory loading (marked with 🤯):
dedup(unless using --sorted)reversesortstats(for advanced statistics)tabletranspose
Memory-intensive commands (marked with 😣):
frequencyschematojsonl
qsv will automatically prevent out-of-memory crashes by checking your system's resources before running these commands.
Many commands automatically use parallel processing when possible:
- With index:
count,stats,frequency,sample,schema,split,tojsonl - Without index:
apply,dedup,diff,sort,sqlp, and others
qsv automatically detects your CPU cores and uses them appropriately.
- Always index large files - It's fast and makes everything else faster
- Use the stats cache - Run
stats --stats-jsonlon your important datasets - For very large files:
- Use
extsortinstead ofsort - Use
extdedupinstead ofdedup - Consider using the
--memcheckoption for memory-intensive operations
- Use
If you need to fine-tune performance further:
-
Buffer sizes can be adjusted:
# Adjust read/write buffer sizes (in bytes) export QSV_RDR_BUFFER_CAPACITY=131072 # Default 128KB export QSV_WTR_BUFFER_CAPACITY=524288 # Default 512KB
-
Control parallel processing:
# Set maximum number of parallel jobs # if you need to use your system for other CPU-intensive tasks # otherwise, qsv will use ALL available CPU cores # If you're just doing casual computing tasks, this is OK export QSV_MAX_JOBS=4
-
Memory safety limits:
# Adjust memory safety margin (10-90%, default 20) # Modern Operating Systems are very smart in dynamically allocating # memory so this is just a safeguard export QSV_FREEMEMORY_HEADROOM_PCT=10
For most users, the default settings will work well. These advanced options are here if you need to optimize for specific scenarios.