All experiments can be run in a docker container.
- Docker
- GPU/cuda environment (for Training)
Dependencies are automatically installed while building a docker image.
# on host
git clone https://github.com/OnizukaLab/Scardina.git
cd Scardina
docker build -t scardina .
docker run --rm --gpus all -v `pwd`:/workspaces/scardina -it scardina bash
# in container
poetry shell
# in poetry env in container
./scripts/dowload_imdb.sh
Choose hyperparameter search by optuna or manually specified parameters.
# train w/ hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp --n-trials=10 -e=20
# train w/o hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp -e=20 --d-word=64 --d-ff=256 --n-ff=4 --lr=5e-4
# evaluation
# Note: When default (-s=cin), model path should be like:
# "models/imdb/mlp-cin/yyyyMMddHHmmss/nar-mlp-imdb-{}-yyyyMMddHHmmss.pt".
# "{}" is literally "{}", a placeholder string to specify multiple models
python scardina/run.py --eval -d=imdb -b=job-light -t=mlp -m={path/to/model.pt}
You can find results in results/<benchmark_name>
after trial.
-d/--dataset
: Dataset name-t/--model-type
: Internal model type (mlp
for MLP ortrm
for Transformer)-s/--schema-strategy
: Internal subschema type (cin
for Closed In-neighborhood Partitioning (Scardina) orur
for Universal Relation)--seed
: Random seed (Default:1234
)--n-blocks
: The number of blocks (for Transformer)--n-heads
: The number of heads (for Transformer)--d-word
: Embedding dimension--d-ff
: Width of feedforward networks--n-ff
: The number of feedforward networks (for MLP)--fact-threshold
: Column factorization threshold (Default:2000
)
-e/--epochs
: Training epoch--batch-size
: Batch size (Default:1024
)
(w/ hyperparameter search)
--n-trials
: The number of trials for hyperparameter search
(w/ specified parameters)
--lr
: Learning rate--warmups
: Warm-up epoch (for Transformer) (lr
andwarmups
are exclusive)
-m/--model
: Path to model-b/--benchmark
: Benchmark name--eval-sample-size
: Sample size for evaluation
- Datasets
- IMDb
imdb
: (almost) All data of IMDbimdb-job-light
: Subset of IMDb for JOB-light benchmark
- IMDb
- Benchmarks
- Models
mlp
: MLP-based denoising autoencodertrm
: Transformer-based denoising autoencoder
@article{scardina,
author = {Ito, Ryuichi and Sasaki, Yuya and Xiao, Chuan and Onizuka, Makoto},
title = {{Scardina: Scalable Join Cardinality Estimation by Multiple Density Estimators}},
journal = {{arXiv preprint arXiv:2303.18042}},
year = {2023}
}