-
Notifications
You must be signed in to change notification settings - Fork 4
Configuration
Datasets, matchers, tokenizers, evaluation metrics, logging, and general settings can be configured via src/main/resources/*.yaml
.
The project uses snakeyaml
to load the configuration into Java objects. You can query a config singleton
via Configuration.getInstance()
. Please see general.yaml
for a list of general configuration parameters and their documentation.
To run only specific steps, you can control their execution via these configuration parameters in general.yaml
:
# Step 2: run first line matchers (i.e., matchers that use table data to match)
saveOutputFirstLineMatchers: True
evaluateFirstLineMatchers: True
readCacheFirstLineMatchers: False
writeCacheFirstLineMatchers: False
# Step 3: run similarity matrix boosting on the output of first line matchers
runSimMatrixBoostingOnFirstLineMatchers: True
saveOutputSimMatrixBoostingOnFirstLineMatchers: True
evaluateSimMatrixBoostingOnFirstLineMatchers: True
readCacheSimMatrixBoostingOnFirstLineMatchers: False
writeCacheSimMatrixBoostingOnFirstLineMatchers: False
# Step 4: run second line matchers (ensemble matchers and other matchers using output of first line matchers)
runSecondLineMatchers: True
saveOutputSecondLineMatchers: True
evaluateSecondLineMatchers: True
readCacheSecondLineMatchers: False
writeCacheSecondLineMatchers: False
# Step 5: run similarity matrix boosting on the output of second line matchers
runSimMatrixBoostingOnSecondLineMatchers: True
saveOutputSimMatrixBoostingOnSecondLineMatchers: True
evaluateSimMatrixBoostingOnSecondLineMatchers: True
readCacheSimMatrixBoostingOnSecondLineMatchers: False
writeCacheSimMatrixBoostingOnSecondLineMatchers: False
Note that these settings can be adjusted by the following global settings in general.yaml
:
# evaluate performance for each attribute and attribute pair in ground truth
# applies to all matching steps for which evaluation is enabled (see below)
evaluateAttributes: True
# write outputs per table pair
# applies to all matching steps for which output saving is enabled (see below)
# WARNING: greatly increases size of results directory
saveOutputPerTablePair: False
# adds header and index with attribute names to output files
# applies to all matching steps for which output saving is enabled (see below)
saveOutputVerbose: True
Note that all other steps depend on the first-line matching step, and the similarity matrix boosting steps depend on their respective line-matching.
Datasets are configured as a list of dataset configurations separated by a line of three dashes (---
). For example, you can add a new dataset by modifying datasets.yaml
like this:
---
name: "Efes-bib"
path: "Efes-bib"
---
name: "myDataset"
path: "newData"
Matchers are configured as a list of matcher configurations separated by a line of three dashes (---
).
To configure first-line and second-line matchers, see first_line_matchers.yaml
and second_line_matchers.yaml
, respectively. You can specify name
, packageName
, and params
.
See for example this configuration:
name: "RandomMatcher"
packageName: "sota"
params:
seed: [42, 2023]
-
packageName.name
gives us the class path, so the matcher class needs to bematching.sota.RandomMatcher
. -
params
allows us to specify a single value or a list of values for all the matcher's parameters. In this case we instantiate theRandomMatcher
twice: once withseed=42
and another one withseed=2023
.
If configuring a TokenizedTablePairMatcher
(e.g., matching.similarity.tokenizedlabel.DiceLabelMatcher
), we need
to configure a list of tokenizers in first_line_tokenizers.yaml
. Internally, this is just another parameter for the matcher;
yet, by specifying it in a custom file, we only need to configure the list of tokenizers once for all matchers that use tokens.
Tokenizers are configured as a list of tokenizer configurations separated by a line of three dashes (---
). Internally, tokenizers are just another parameter for first-line matchers; yet, by specifying it in a custom file, we only need to configure the list of tokenizers once for all matchers that use tokens.
To configure tokenizers for first-line tokenized matchers, see first_line_tokenizers.yaml
. You can specify name
, and params
.
See for example this configuration:
"name": "nGramTokenizer"
"params":
"n": [2,4]
-
name
gives us the class path, so the tokenizer class needs to bepreprocessing.tokenization.nGramTokenizer
. -
params
allows us to specify a single value or a list of values for all the tokenizer's parameters. In this case we instantiate thenGramTokenizer
twice: once withn=2
and another one withn=4
.
Note that similarity matrix boosting cannot be configured via .yaml
files yet.
Please see at the beginning of Main.main()
, there is a block like this specifying those steps:
SimMatrixBoosting firstLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
SimMatrixBoosting secondLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
Evaluation metrics are configured as a list of metric configurations separated by a line of three dashes (---
) in metrics.yaml
. For now, metrics only have a name
parameter which resolves to the class path evaluation.metric.<name>
.
All configured metrics are applied in the configured evaluation steps, as well as on attribute-level if evaluateAttributes
is enabled in general.yaml
.