Scardina: Scalable Join Cardinality Estimatior

Prerequisites

All experiments can be run in a docker container.

Docker
GPU/cuda environment (for Training)

Getting Started

Setup

Dependencies are automatically installed while building a docker image.

# on host
git clone https://github.com/OnizukaLab/Scardina.git
cd Scardina
docker build -t scardina .
docker run --rm --gpus all -v `pwd`:/workspaces/scardina -it scardina bash

# in container
poetry shell

# in poetry env in container
./scripts/dowload_imdb.sh

Examples

Training

Choose hyperparameter search by optuna or manually specified parameters.

# train w/ hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp --n-trials=10 -e=20

# train w/o hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp -e=20 --d-word=64 --d-ff=256 --n-ff=4 --lr=5e-4

Evaluation

# evaluation
# Note: When default (-s=cin), model path should be like:
#       "models/imdb/mlp-cin/yyyyMMddHHmmss/nar-mlp-imdb-{}-yyyyMMddHHmmss.pt".
#       "{}" is literally "{}", a placeholder string to specify multiple models
python scardina/run.py --eval -d=imdb -b=job-light -t=mlp -m={path/to/model.pt}

You can find results in results/<benchmark_name> after trial.

Options

Common Options

-d/--dataset: Dataset name
-t/--model-type: Internal model type (mlp for MLP or trm for Transformer)
-s/--schema-strategy: Internal subschema type (cin for Closed In-neighborhood Partitioning (Scardina) or ur for Universal Relation)
--seed: Random seed (Default: 1234)
--n-blocks: The number of blocks (for Transformer)
--n-heads: The number of heads (for Transformer)
--d-word: Embedding dimension
--d-ff: Width of feedforward networks
--n-ff: The number of feedforward networks (for MLP)
--fact-threshold: Column factorization threshold (Default: 2000)

Options for Training

-e/--epochs: Training epoch
--batch-size: Batch size (Default: 1024)

(w/ hyperparameter search)

--n-trials: The number of trials for hyperparameter search

(w/ specified parameters)

--lr: Learning rate
--warmups: Warm-up epoch (for Transformer) (lr and warmups are exclusive)

Options for Evaluation

-m/--model: Path to model
-b/--benchmark: Benchmark name
--eval-sample-size: Sample size for evaluation

Choices

Datasets
- IMDb
  - imdb: (almost) All data of IMDb
  - imdb-job-light: Subset of IMDb for JOB-light benchmark
Benchmarks
- IMDb
  - job-light: Real-world 70 queries
  - job-m: Real-world 113 queries
  - job-light_subqueries: Real-world 70 queries for evaluating P-Error (Need DB)
  - job-m_subqueries: Real-world 113 queries for evaluating P-Error (Need DB)
Models
- mlp: MLP-based denoising autoencoder
- trm: Transformer-based denoising autoencoder

Reference

@article{scardina,
    author = {Ito, Ryuichi and Sasaki, Yuya and Xiao, Chuan and Onizuka, Makoto},
    title = {{Scardina: Scalable Join Cardinality Estimation by Multiple Density Estimators}},
    journal = {{arXiv preprint arXiv:2303.18042}},
    year = {2023}
}

Acknowledgement

Some source codes are based on naru/neurocard

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Scardina: Scalable Join Cardinality Estimatior

Prerequisites

Getting Started

Setup

Examples

Training

Evaluation

Options

Common Options

Options for Training

Options for Evaluation

Choices

Reference

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

Scardina: Scalable Join Cardinality Estimatior

Prerequisites

Getting Started

Setup

Examples

Training

Evaluation

Options

Common Options

Options for Training

Options for Evaluation

Choices

Reference

Acknowledgement