A semantic join operator which attempts to maximize the number of valid value matches captured between the values sets of two columns
Matchmaker is a reinforcement learning-based semantic join operator designed to maximize the number of valid value matches between two column value sets. This project uses Reinforcement Learning (RL) to train an intelligent agent that can dynamically select and combine multiple matching algorithms based on different input features, achieving an optimal balance between accuracy and cost.
This project employs Reinforcement Learning (PPO algorithm) to train an agent that can dynamically select from the following matching algorithms:
- lexical: Lexical-based matching algorithm (lightweight)
- semantic: Semantic similarity-based matching algorithm (moderate cost)
- llm: Large Language Model-based reasoning matching algorithm (expensive)
- shingles: Character n-gram-based matching algorithm (lightweight)
- regex: Regular expression-based matching algorithm (moderate cost)
- identity: Exact matching algorithm (lightweight)
- accent_fold: Accent folding-based matching algorithm (lightweight)
The agent learns to select the most appropriate combination of matching algorithms within a given cost budget by observing input value features (such as edit distance, semantic similarity, etc.), maximizing matching accuracy.
First, install the required dependencies:
pip install -r requirements.txtThe project supports four datasets: autofj, ss, wt, and kbwt.
To run:
python main.py <dataset_name>Where <dataset_name> can be one of:
autofjsswtkbwt
Example:
# Train and evaluate on the autofj dataset
python main.py autofj
# Train and evaluate on the ss dataset
python main.py ssThe program will:
- Automatically read and format the specified dataset
- Train the reinforcement learning agent (if model checkpoint doesn't exist)
- Evaluate the agent's performance on the test set
- Output metrics including accuracy, precision, recall, and F1 score
Trained models are saved in the model_checkpoints/<dataset_name>/ directory. Evaluation results and metrics are saved in metrics_output.txt.
We provide experimental results from two baseline methods, located at:
Results file: baseline/dtt_result/dtt_result.txt
This file contains performance metrics of the DTT (Deep Table Transformation) method on four datasets, including:
- Accuracy
- Precision
- Recall
- F1 Score
- Average edit distance
- Runtime
Results file: baseline/autofuzzy_result/autoFuzzy_result.txt
This file contains performance metrics of the AutoFuzzy method on four datasets, including:
- Accuracy
- Precision
- Recall
- F1 Score
- Number of datasets and result sizes
Both baseline results cover the four datasets (autofj, ss, wt, and kbwt) and provide overall average performance metrics, which can be used for comparison with our reinforcement learning approach.