This repository provides tools for substrait
Create and activate your conda environment with python3.9:
conda create -y -n substrait_tools -c conda-forge python=3.9 openjdk
conda activate substrait_tools
pip install substrait-tools
prepare_tpch_data
Parquet data written to /Users/richardtia/substrait-tools/tpch_data
Example usage with SQL:
produce_substrait --producer IsthmusProducer --schema ./schema.sql --query "select * from lineitem" --validate --validator_overrides 1002
Validating plan: True
Validator override levels: [1002]
substrait plan written to: Isthmus_substrait.json
Example usage with Ibis:
produce_substrait --producer IbisProducer --schema ./tables.py --query_type ibis --ibis_expr expr=./expr.py --validate --validator_overrides 1002 2001 1
Validating plan: True
Validator override levels: [1002, 2001, 1]
substrait plan written to: Ibis_substrait.json
Arguments:
--producer: Which substrait producer to generate the plan with.
--schema: SQL Schema. Each create table command should be on its own line.
--query: SQL query.
--ibis_expr: Ibis expression. Argument should passed as a key value pair with the ibis expression function name and the python file with the function definition (--ibis_expr test_expr=./expression.py
)
--validate: Default is false. If set, the plan will run against the substrait validator.
--validator_overrides: Bypass error codes found by the substrait validator. One or more integers separates by spaces.
Sample schemas:
SQL: schema.sql
CREATE TABLE lineitem(l_orderkey INTEGER NOT NULL, l_partkey INTEGER NOT NULL, l_suppkey INTEGER NOT NULL, l_linenumber INTEGER NOT NULL, l_quantity INTEGER NOT NULL, l_extendedprice DECIMAL(15,2) NOT NULL, l_discount DECIMAL(15,2) NOT NULL, l_tax DECIMAL(15,2) NOT NULL, l_returnflag VARCHAR NOT NULL, l_linestatus VARCHAR NOT NULL, l_shipdate DATE NOT NULL, l_commitdate DATE NOT NULL, l_receiptdate DATE NOT NULL, l_shipinstruct VARCHAR NOT NULL, l_shipmode VARCHAR NOT NULL, l_comment VARCHAR NOT NULL);
**Note: Each CREATE TABLE
command should be on its own line.
Ibis: ibis_schema.py
lineitem = ibis.table(
[
("l_orderkey", dt.int64),
("l_partkey", dt.int64),
("l_suppkey", dt.int64),
("l_linenumber", dt.int64),
("l_quantity", dt.Decimal(15, 2)),
("l_extendedprice", dt.Decimal(15, 2)),
("l_discount", dt.Decimal(15, 2)),
("l_tax", dt.Decimal(15, 2)),
("l_returnflag", dt.string),
("l_linestatus", dt.string),
("l_shipdate", dt.date),
("l_commitdate", dt.date),
("l_receiptdate", dt.date),
("l_shipinstruct", dt.string),
("l_shipmode", dt.string),
("l_comment", dt.string),
],
name="lineitem",
)
--schema ./ibis_schema.py
Sample ibis expression file:
expression.py
def expr(lineitem):
new_col = lineitem.l_tax.acos().name("ACOS_TAX")
return lineitem[new_col]
--query_type ibis --ibis_expr expr=./expression.py
**Note: The key name (expr
) should be the same as the function.
Example usage
consume_substrait --consumer DuckDBConsumer --substrait_plan ./Isthmus_substrait.json --table_file_pair lineitem=./lineitem.parquet region=./region.parquet
Arguments:
--consumer: Which substrait consumer to consume the plan with.
--substrait_plan: Json formatted substrait plan.
--table_file_pair: One or more table name and file pairings. Table name and file should be separated by an equal (=) sign.