Configuration

Datasets, matchers, tokenizers, evaluation metrics, logging, and general settings can be configured via src/main/resources/*.yaml. The project uses snakeyaml to load the configuration into Java objects. You can query a config singleton via Configuration.getInstance(). Please see general.yaml for a list of general configuration parameters and their documentation.

Matching Steps

To run only specific steps, you can control their execution via these configuration parameters in general.yaml:

# Step 2: run first line matchers (i.e., matchers that use table data to match)
saveOutputFirstLineMatchers: True
evaluateFirstLineMatchers: True
readCacheFirstLineMatchers: False
writeCacheFirstLineMatchers: False
# Step 3: run similarity matrix boosting on the output of first line matchers
runSimMatrixBoostingOnFirstLineMatchers: True
saveOutputSimMatrixBoostingOnFirstLineMatchers: True
evaluateSimMatrixBoostingOnFirstLineMatchers: True
readCacheSimMatrixBoostingOnFirstLineMatchers: False
writeCacheSimMatrixBoostingOnFirstLineMatchers: False
# Step 4: run second line matchers (ensemble matchers and other matchers using output of first line matchers)
runSecondLineMatchers: True
saveOutputSecondLineMatchers: True
evaluateSecondLineMatchers: True
readCacheSecondLineMatchers: False
writeCacheSecondLineMatchers: False
# Step 5: run similarity matrix boosting on the output of second line matchers
runSimMatrixBoostingOnSecondLineMatchers: True
saveOutputSimMatrixBoostingOnSecondLineMatchers: True
evaluateSimMatrixBoostingOnSecondLineMatchers: True
readCacheSimMatrixBoostingOnSecondLineMatchers: False
writeCacheSimMatrixBoostingOnSecondLineMatchers: False

Note that these settings can be adjusted by the following global settings in general.yaml:

# evaluate performance for each attribute and attribute pair in ground truth
# applies to all matching steps for which evaluation is enabled (see below)
evaluateAttributes: True
# write outputs per table pair
# applies to all matching steps for which output saving is enabled (see below)
# WARNING: greatly increases size of results directory
saveOutputPerTablePair: False
# adds header and index with attribute names to output files
# applies to all matching steps for which output saving is enabled (see below)
saveOutputVerbose: True

Note that all other steps depend on the first-line matching step, and the similarity matrix boosting steps depend on their respective line-matching.

Datasets

Datasets are configured as a list of dataset configurations separated by a line of three dashes (---). For example, you can add a new dataset by modifying datasets.yaml like this:

---
name: "Efes-bib"
path: "Efes-bib"
---
name: "myDataset"
path: "newData"

Matchers

Matchers are configured as a list of matcher configurations separated by a line of three dashes (---). To configure first-line and second-line matchers, see first_line_matchers.yaml and second_line_matchers.yaml, respectively. You can specify name, packageName, and params.

See for example this configuration:

name: "RandomMatcher"
packageName: "sota"
params:
  seed: [42, 2023]

packageName.name gives us the class path, so the matcher class needs to be matching.sota.RandomMatcher.
params allows us to specify a single value or a list of values for all the matcher's parameters. In this case we instantiate the RandomMatcher twice: once with seed=42 and another one with seed=2023.

If configuring a TokenizedTablePairMatcher (e.g., matching.similarity.tokenizedlabel.DiceLabelMatcher), we need to configure a list of tokenizers in first_line_tokenizers.yaml. Internally, this is just another parameter for the matcher; yet, by specifying it in a custom file, we only need to configure the list of tokenizers once for all matchers that use tokens.

Tokenizers

Tokenizers are configured as a list of tokenizer configurations separated by a line of three dashes (---). Internally, tokenizers are just another parameter for first-line matchers; yet, by specifying it in a custom file, we only need to configure the list of tokenizers once for all matchers that use tokens.

To configure tokenizers for first-line tokenized matchers, see first_line_tokenizers.yaml. You can specify name, and params.

See for example this configuration:

"name": "nGramTokenizer"
"params":
  "n": [2,4]

name gives us the class path, so the tokenizer class needs to be preprocessing.tokenization.nGramTokenizer.
params allows us to specify a single value or a list of values for all the tokenizer's parameters. In this case we instantiate the nGramTokenizer twice: once with n=2 and another one with n=4.

Similarity Matrix Boosting

Note that similarity matrix boosting cannot be configured via .yaml files yet. Please see at the beginning of Main.main(), there is a block like this specifying those steps:

SimMatrixBoosting firstLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
SimMatrixBoosting secondLineSimMatrixBoosting = new IdentitySimMatrixBoosting();

Metrics

Evaluation metrics are configured as a list of metric configurations separated by a line of three dashes (---) in metrics.yaml. For now, metrics only have a name parameter which resolves to the class path evaluation.metric.<name>.

All configured metrics are applied in the configured evaluation steps, as well as on attribute-level if evaluateAttributes is enabled in general.yaml.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly