- Arxiv article: "Tabular GANs for uneven distribution"
- Medium post: GANs for tabular data
- Installation:
pip install tabgan
- To generate new data to train by sampling and then filtering by adversarial training
call
GANGenerator().generate_data_pipe
:
from tabgan.sampler import OriginalGenerator, GANGenerator
import pandas as pd
import numpy as np
# random input data
train = pd.DataFrame(np.random.randint(-10, 150, size=(150, 4)), columns=list("ABCD"))
target = pd.DataFrame(np.random.randint(0, 2, size=(150, 1)), columns=list("Y"))
test = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
# generate data
new_train1, new_target1 = OriginalGenerator().generate_data_pipe(train, target, test, )
new_train2, new_target2 = GANGenerator().generate_data_pipe(train, target, test, )
# example with all params defined
new_train3, new_target3 = GANGenerator(gen_x_times=1.1, cat_cols=None,
bot_filter_quantile=0.001, top_filter_quantile=0.999, is_post_process=True,
adversarial_model_params={
"metrics": "AUC", "max_depth": 2, "max_bin": 100,
"learning_rate": 0.02, "random_state": 42, "n_estimators": 500,
}, pregeneration_frac=2, only_generated_data=False,
gan_params = {"batch_size": 500, "patience": 25, "epochs" : 500,}).generate_data_pipe(train, target,
test, deep_copy=True, only_adversarial=False, use_adversarial=True)
Both samplers OriginalGenerator
and GANGenerator
have same input parameters:
- gen_x_times: float = 1.1 - how much data to generate, output might be less because of postprocessing and adversarial filtering
- cat_cols: list = None - categorical columns
- bot_filter_quantile: float = 0.001 - bottom quantile for postprocess filtering
- top_filter_quantile: float = 0.999 - top quantile for postprocess filtering
- is_post_process: bool = True - perform or not post-filtering, if false bot_filter_quantile and top_filter_quantile ignored
- adversarial_model_params: dict params for adversarial filtering model, default values for binary task
- pregeneration_frac: float = 2 - for generataion step gen_x_times * pregeneration_frac amount of data will generated. However in postprocessing (1 + gen_x_times) % of original data will be returned
- gan_params: dict params for GAN training
For generate_data_pipe
methods params:
- train_df: pd.DataFrame Train dataframe which has separate target
- target: pd.DataFrame Input target for the train dataset
- test_df: pd.DataFrame Test dataframe - newly generated train dataframe should be close to it
- deep_copy: bool = True - make copy of input files or not. If not input dataframes will be overridden
- only_adversarial: bool = False - only adversarial fitering to train dataframe will be performed
- use_adversarial: bool = True - perform or not adversarial filtering
- only_generated_data: bool = False - After generation get only newly generated, without concating input train dataframe.
- @return: -> Tuple[pd.DataFrame, pd.DataFrame] - Newly generated train dataframe and test data
Thus, you may use this library to improve your dataset quality:
def fit_predict(clf, X_train, y_train, X_test, y_test):
clf.fit(X_train, y_train)
return sklearn.metrics.roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1])
dataset = sklearn.datasets.load_breast_cancer()
clf = sklearn.ensemble.RandomForestClassifier(n_estimators=25, max_depth=6)
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
pd.DataFrame(dataset.data), pd.DataFrame(dataset.target, columns=["target"]), test_size=0.33, random_state=42)
print("initial metric", fit_predict(clf, X_train, y_train, X_test, y_test))
new_train1, new_target1 = OriginalGenerator().generate_data_pipe(X_train, y_train, X_test, )
print("OriginalGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
new_train1, new_target1 = GANGenerator().generate_data_pipe(X_train, y_train, X_test, )
print("GANGenerator metric", fit_predict(clf, new_train1, new_target1, X_test, y_test))
You can easily adjust code to generate multidimensional timeseries data. Basically it extracts days, months and year from date. Demo how to use in the example below:
import pandas as pd
import numpy as np
from tabgan.utils import get_year_mnth_dt_from_date,make_two_digit,collect_dates
from tabgan.sampler import OriginalGenerator, GANGenerator
train_size = 100
train = pd.DataFrame(
np.random.randint(-10, 150, size=(train_size, 4)), columns=list("ABCD")
)
min_date = pd.to_datetime('2019-01-01')
max_date = pd.to_datetime('2021-12-31')
d = (max_date - min_date).days + 1
train['Date'] = min_date + pd.to_timedelta(pd.np.random.randint(d, size=train_size), unit='d')
train = get_year_mnth_dt_from_date(train, 'Date')
new_train, new_target = GANGenerator(gen_x_times=1.1, cat_cols=['year'], bot_filter_quantile=0.001,
top_filter_quantile=0.999,
is_post_process=True, pregeneration_frac=2, only_generated_data=False).\
generate_data_pipe(train.drop('Date', axis=1), None,
train.drop('Date', axis=1)
)
new_train = collect_dates(new_train)
Running experiment
To run experiment follow these steps:
- Clone the repository. All required dataset are stored in
./Research/data
folder - Install requirements
pip install -r requirements.txt
- Run all experiments
python ./Research/run_experiment.py
. Run all experimentspython run_experiment.py
. You may add more datasets, adjust validation type and categorical encoders. - Observe metrics across all experiment in console or in
./Research/results/fit_predict_scores.txt
Task formalization
Let say we have T_train and T_test (train and test set respectively). We need to train the model on T_train and make predictions on T_test. However, we will increase the train by generating new data by GAN, somehow similar to T_test, without using ground truth labels.
Experiment design
In the case of having a smaller T_train and a different data distribution, we can use CTGAN to generate additional data T_synth. First, we train CTGAN on T_train with ground truth labels (step 1), then generate additional data T_synth (step 2). Secondly, we train boosting in an adversarial way on concatenated T_train and T_synth (target set to 0) with T_test (target set to 1) (steps 3 & 4). The goal is to apply the newly trained adversarial boosting to obtain rows more like T_test. Note that initial ground truth labels aren't used for adversarial training. As a result, we take top rows from T_train and T_synth sorted by correspondence to T_test (steps 5 & 6), and train new boosting on them and check results on T_test.
Picture 1.1 Experiment design and workflow
Of course for the benchmark purposes we will test ordinal training without these tricks and another original pipeline but without CTGAN (in step 3 we won"t use T_sync).
Datasets
All datasets came from different domains. They have a different number of observations, number of categorical and numerical features. The objective for all datasets - binary classification. Preprocessing of datasets were simple: removed all time-based columns from datasets. Remaining columns were either categorical or numerical.
Table 1.1 Used datasets
Name | Total points | Train points | Test points | Number of features | Number of categorical features | Short description |
---|---|---|---|---|---|---|
Telecom | 7.0k | 4.2k | 2.8k | 20 | 16 | Churn prediction for telecom data |
Adult | 48.8k | 29.3k | 19.5k | 15 | 8 | Predict if persons" income is bigger 50k |
Employee | 32.7k | 19.6k | 13.1k | 10 | 9 | Predict an employee"s access needs, given his/her job role |
Credit | 307.5k | 184.5k | 123k | 121 | 18 | Loan repayment |
Mortgages | 45.6k | 27.4k | 18.2k | 20 | 9 | Predict if house mortgage is founded |
Taxi | 892.5k | 535.5k | 357k | 8 | 5 | Predict the probability of an offer being accepted by a certain driver |
Poverty_A | 37.6k | 22.5k | 15.0k | 41 | 38 | Predict whether or not a given household for a given country is poor or not |
To determine the best sampling strategy, ROC AUC scores of each dataset were scaled (min-max scale) and then averaged among the dataset.
Table 1.2 Different sampling results across the dataset, higher is better (100% - maximum per dataset ROC AUC)
dataset_name | None | gan | sample_original |
---|---|---|---|
credit | 0.997 | 0.998 | 0.997 |
employee | 0.986 | 0.966 | 0.972 |
mortgages | 0.984 | 0.964 | 0.988 |
poverty_A | 0.937 | 0.950 | 0.933 |
taxi | 0.966 | 0.938 | 0.987 |
adult | 0.995 | 0.967 | 0.998 |
telecom | 0.995 | 0.868 | 0.992 |
Table 1.3 Different sampling results, higher is better for a mean (ROC AUC), lower is better for std (100% - maximum per dataset ROC AUC)
sample_type | mean | std |
---|---|---|
None | 0.980 | 0.036 |
gan | 0.969 | 0.06 |
sample_original | 0.981 | 0.032 |
Table 1.4 same_target_prop is equal 1 then the target rate for train and test are different no more than 5%. Higher is better.
sample_type | same_target_prop | prop_test_score |
---|---|---|
None | 0 | 0.964 |
None | 1 | 0.985 |
gan | 0 | 0.966 |
gan | 1 | 0.945 |
sample_original | 0 | 0.973 |
sample_original | 1 | 0.984 |
The author would like to thank Open Data Science community [7] for many valuable discussions and educational help in the growing field of machine and deep learning.
If you use GAN-for-tabular-data in a scientific publication, we would appreciate references to the following BibTex entry: arxiv publication:
@misc{ashrapov2020tabular,
title={Tabular GANs for uneven distribution},
author={Insaf Ashrapov},
year={2020},
eprint={2010.00638},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
library itself:
@misc{Diyago2020tabgan,
author = {Ashrapov, Insaf},
title = {GANs for tabular data},
howpublished = {\url{https://github.com/Diyago/GAN-for-tabular-data}},
year = {2020}
}
[1] Jonathan Hui. GAN — What is Generative Adversarial Networks GAN? (2018), medium article
[2]Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio. Generative Adversarial Networks (2014). arXiv:1406.2661
[3] Lei Xu LIDS, Kalyan Veeramachaneni. Synthesizing Tabular Data using Generative Adversarial Networks (2018). arXiv: 1811.11264v1 [cs.LG]
[4] Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, Kalyan Veeramachaneni. Modeling Tabular Data using Conditional GAN (2019). arXiv:1907.00503v2 [cs.LG]
[5] Denis Vorotyntsev. Benchmarking Categorical Encoders. Medium post
[6] Insaf Ashrapov. GAN-for-tabular-data. Github repository.
[7] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, Timo Aila. Analyzing and Improving the Image Quality of StyleGAN (2019) arXiv:1912.04958v2 [cs.CV]
[8] ODS.ai: Open data science, https://ods.ai/