Converting Arrow files has never been silkier...
Silk Chiffon is a blazingly fast, memory-efficient CLI tool for converting from/to the Apache Arrow IPC, Parquet, and Vortex columnar data formats. Written in Rust for maximum performance.
Like its namesake fabric -- light, flowing, and effortlessly elegant -- this tool makes data transformations silky smooth.
- β‘ Lightning Fast: Built with Rust for native performance.
- π€Ήπ»ββ Multi-Format Support: Convert to/from Arrow IPC (file/stream), Parquet, and Vortex.
- πͺ Partitioning: Partition data into multiple files based on column values.
- π Merging: Merge data from multiple files into a single file.
- π§ Smart Processing: Sort, compress, filter with SQL, and optimize your data on-the-fly.
- π¦ Re-cast column types: Use SQL to convert input columns to right-sized output columns and even add new ones.
- π€π» Memory Efficient: Configurable batch processing for huge datasets.
- βοΈ Rich Configuration: Fine-tune many aspects of your conversions.
cargo install --path .cargo install --git https://github.com/acuitymd/silk-chiffonYou can download prebuilt binaries from each of our releases.
Important
macOS will correctly detect that the downloaded binary is unsigned and will graciously offer to yeet the entire binary. To remove this roadblock you can unquarantine the binary using: xattr -d com.apple.quarantine /path/to/silk-chiffon.
Convert Arrow to Parquet with compression and sorting:
silk-chiffon transform --from input.arrow --to output.parquet --parquet-compression zstd --sort-by amountConvert between Arrow formats:
silk-chiffon transform --from stream.arrows --to file.arrow --arrow-compression lz4Merge Arrow files into a single Parquet file:
silk-chiffon transform --from-many file1.arrow --from-many file2.arrow --to merged.parquetUsing glob patterns:
silk-chiffon transform --from-many '*.arrow' --to merged.parquetPartition data by column values:
silk-chiffon transform --from data.arrow --to-many '{{region}}/data.parquet' --by regionMulti-column partitioning:
silk-chiffon transform --from data.arrow --to-many '{{year}}/{{month}}/data.parquet' --by year,monthFilter data with SQL queries:
silk-chiffon transform --from data.arrow --to filtered.parquet \
--query "SELECT * FROM data WHERE amount > 1000 AND status = 'active'"Convert input columns to different types, and even add new ones:
silk-chiffon transform --from data.arrow --to date_casted.parquet \
--query "SELECT * EXCEPT (created_at), arrow_cast(created_at, 'Date32') AS created_at FROM data"
silk-chiffon transform --from data.arrow --to id_casted.parquet \
--query "SELECT * EXCEPT (id), arrow_cast(id, 'Int32') AS id FROM data"
silk-chiffon transform --from data.arrow --to added_creation_year.parquet \
--query "SELECT *, extract(year FROM created_at) AS creation_year FROM data"Merge, filter, sort, and partition in one command:
silk-chiffon transform \
--from-many 'source/*.arrow' \
--to-many 'output/{{region}}/{{year}}.parquet' \
--by region,year \
--query "SELECT * FROM data WHERE status = 'active'" \
--sort-by date:desc \
--parquet-compression zstdThe transform command is your one-stop shop for all data transformations:
silk-chiffon transform [OPTIONS]--from <PATH>- Single input file path--from-many <PATH>- Multiple input files (supports glob patterns, can be specified multiple times)--to <PATH>- Single output file path--to-many <TEMPLATE>- Output path template for partitioning (e.g.,{{column}}.parquet)--by <COLUMNS>- Column(s) to partition by (comma-separated, requires--to-many)
--query <SQL>- SQL query to filter/transform data--dialect <DIALECT>- SQL dialect (duckdb, postgres, mysql, sqlite, etc.)--sort-by <SPEC>- Sort by columns (e.g.,date,amount:desc)--exclude-columns <COLS>- Columns to exclude from output
--input-format <FORMAT>- Override input format detection (arrow, parquet)--output-format <FORMAT>- Override output format detection (arrow, parquet)
--arrow-compression <CODEC>- Compression codec (zstd, lz4, none)--arrow-format <FORMAT>- IPC format (file, stream)--arrow-record-batch-size <SIZE>- Record batch size
--parquet-compression <CODEC>- Compression codec (zstd, snappy, gzip, lz4, none)--parquet-row-group-size <SIZE>- Maximum rows per row group--parquet-statistics <LEVEL>- Statistics level (none, chunk, page)--parquet-writer-version <VERSION>- Writer version (v1, v2)--parquet-dictionary-all-off- Disable dictionary encoding--parquet-sorted-metadata- Embed sorted metadata (requires--sort-by)
--vortex-record-batch-size <VORTEX_RECORD_BATCH_SIZE>- Vortex record batch size
Bloom filters are automatically enabled for columns that keep dictionary encoding
(low cardinality columns). The NDV is determined by cardinality analysis. Use
--parquet-bloom-all-off to disable globally, or --parquet-bloom-column-off
to exclude specific columns.
Specifying NDV explicitly forces bloom filters ON regardless of dictionary state:
--parquet-bloom-all "ndv=10000" # Force bloom on all columns with NDV=10000
--parquet-bloom-all "fpp=0.001,ndv=10000" # Custom FPP and NDVJust specifying FPP uses the automatic dictionary-based decision:
--parquet-bloom-all "fpp=0.001" # Custom FPP, auto NDV from analysisPer-column configuration (overrides defaults):
--parquet-bloom-column "user_id:ndv=50000" # Force bloom on with NDV
--parquet-bloom-column "user_id:fpp=0.001,ndv=50000" # Custom FPP and NDV
--parquet-bloom-column-off "high_cardinality_col" # Disable for specific column--create-dirs- Create output directories as needed (default: true)--overwrite- Overwrite existing files--list-outputs <FORMAT>- List output files after creation (text, json)
Generate shell completions for your shell:
# To add completions for your current shell session only
## zsh
eval "$(silk-chiffon completions zsh)"
## bash
eval "$(silk-chiffon completions bash)"
## fish
silk-chiffon completions fish | source
# To persist completions across sessions
## zsh
echo 'eval "$(silk-chiffon completions zsh)"' >> ~/.zshrc
## bash
echo 'eval "$(silk-chiffon completions bash)"' >> ~/.bashrc
## fish
silk-chiffon completions fish > ~/.config/fish/completions/silk-chiffon.fishsilk-chiffon transform --from data.arrow --to sorted.parquet --sort-by timestamp:desc,user_idsilk-chiffon transform \
--from events.arrow \
--to-many 'partitioned/year={{year}}/month={{month}}/data.parquet' \
--by year,month \
--list-outputs textsilk-chiffon transform \
--from transactions.arrow \
--to summary.parquet \
--query "SELECT region, DATE_TRUNC('month', date) as month, SUM(amount) as total FROM data GROUP BY region, month"# force bloom filters on high-cardinality ID columns with explicit NDV
silk-chiffon transform \
--from-many 'logs/*.arrow' \
--to optimized.parquet \
--parquet-compression zstd \
--parquet-bloom-column "user_id:fpp=0.001,ndv=1000000" \
--parquet-bloom-column "session_id:fpp=0.001,ndv=5000000"silk-chiffon transform \
--from-many 'raw_data_*.arrow' \
--to-many 'processed/{{status}}/data_{{hash}}.parquet' \
--by status \
--query "SELECT *, CASE WHEN amount > 1000 THEN 'high' ELSE 'low' END as tier FROM data WHERE timestamp > '2024-01-01'" \
--sort-by timestamp:desc \
--parquet-compression zstd \
--parquet-sorted-metadata \
--list-outputs jsonjust buildjust testjust lintjust fmtSilk Chiffon is open source software, licensed under LICENSE.
Made with π¦ and β€οΈ by AcuityMD for the data community