-
Notifications
You must be signed in to change notification settings - Fork 4
Configuration
Datasets, first-line matchers, logging, and general configuration happens in src/main/resources/*.yaml
.
The project uses snakeyaml
to load the configuration into Java objects. You can query a config singleton
via Configuration.getInstance()
. Please see general.yaml
for a list of general configuration parameters and their documentation.
To run only specific steps, you can control their execution via these configuration parameters in general.yaml
:
# Step 2: run first line matchers (i.e., matchers that use table data to match)
saveOutputFirstLineMatchers: True
evaluateFirstLineMatchers: True
# Step 3: run similarity matrix boosting on the output of first line matchers
runSimMatrixBoostingOnFirstLineMatchers: True
saveOutputSimMatrixBoostingOnFirstLineMatchers: True
evaluateSimMatrixBoostingOnFirstLineMatchers: True
# Step 4: run second line matchers (ensemble matchers and other matchers using output of first line matchers)
runSecondLineMatchers: True
saveOutputSecondLineMatchers: True
evaluateSecondLineMatchers: True
# Step 5: run similarity matrix boosting on the output of second line matchers
runSimMatrixBoostingOnSecondLineMatchers: True
saveOutputSimMatrixBoostingOnSecondLineMatchers: True
evaluateSimMatrixBoostingOnSecondLineMatchers: True
We recommend to only save outputs when they are actually useful for you, as they result in a lot of files. If you are, for example only working on first-line matchers, you might want to also turn off evaluation for the other steps.
Note that all other steps depend on the first-line matching step, and the similarity matrix boosting steps depend on their respective line-matching.
For datasets, first-line matchers and tokenizers, it accepts lists of multiple instances, separated by a line of three dashes (---
).
For example, you can add a new dataset by modifying datasets.yaml
like this:
---
name: "Efes"
path: "Efes"
---
name: "myDataset"
path: "newData"
To configure first-line matchers, see first_line_matchers.yaml
. You can specify name
, packageName
, and params
.
See for example this configuration:
name: "RandomMatcher"
packageName: "sota"
params:
seed: [42, 2023]
packageName.name
gives us the class path, so the matcher class needs to be matching.sota.RandomMatcher
.
params
allows us to specify a single value or a list of values for all the matcher's parameters. In this case
we instantiate the RandomMatcher
twice: once with seed=42
and another one with seed=2023
.
If configuring a TokenizedMatcher
(e.g., matching.similarity.tokenizedlabel.DiceLabelMatcher
), we need
to configure a list of tokenizers in first_line_tokenizers.yaml
. Internally, this is just another parameter for the matcher;
yet, by specifying it in a custom file, we only need to configure the list of tokenizers once for all matchers that use tokens.
Note that similarity matrix boosting cannot be configured via .yaml
files yet.
Please see at the beginning of Main.main()
, there is a block like this specifying those steps:
SimMatrixBoosting firstLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
SimMatrixBoosting secondLineSimMatrixBoosting = new IdentitySimMatrixBoosting();