Configuration

Datasets, first-line matchers, logging, and general configuration happens in src/main/resources/*.yaml. The project uses snakeyaml to load the configuration into Java objects. You can query a config singleton via Configuration.getInstance(). Please see general.yaml for a list of general configuration parameters and their documentation.

Matching Steps

To run only specific steps, you can control their execution via these configuration parameters in general.yaml:

# Step 2: run first line matchers (i.e., matchers that use table data to match)
saveOutputFirstLineMatchers: True
evaluateFirstLineMatchers: True
# Step 3: run similarity matrix boosting on the output of first line matchers
runSimMatrixBoostingOnFirstLineMatchers: True
saveOutputSimMatrixBoostingOnFirstLineMatchers: True
evaluateSimMatrixBoostingOnFirstLineMatchers: True
# Step 4: run second line matchers (ensemble matchers and other matchers using output of first line matchers)
runSecondLineMatchers: True
saveOutputSecondLineMatchers: True
evaluateSecondLineMatchers: True
# Step 5: run similarity matrix boosting on the output of second line matchers
runSimMatrixBoostingOnSecondLineMatchers: True
saveOutputSimMatrixBoostingOnSecondLineMatchers: True
evaluateSimMatrixBoostingOnSecondLineMatchers: True

We recommend to only save outputs when they are actually useful for you, as they result in a lot of files. If you are, for example only working on first-line matchers, you might want to also turn off evaluation for the other steps.

Note that all other steps depend on the first-line matching step, and the similarity matrix boosting steps depend on their respective line-matching.

Datasets

For datasets, first-line matchers and tokenizers, it accepts lists of multiple instances, separated by a line of three dashes (---). For example, you can add a new dataset by modifying datasets.yaml like this:

---
name: "Efes"
path: "Efes"
---
name: "myDataset"
path: "newData"

Matchers

To configure first-line matchers, see first_line_matchers.yaml. You can specify name, packageName, and params. See for example this configuration:

name: "RandomMatcher"
packageName: "sota"
params:
  seed: [42, 2023]

packageName.name gives us the class path, so the matcher class needs to be matching.sota.RandomMatcher. params allows us to specify a single value or a list of values for all the matcher's parameters. In this case we instantiate the RandomMatcher twice: once with seed=42 and another one with seed=2023.

If configuring a TokenizedMatcher (e.g., matching.similarity.tokenizedlabel.DiceLabelMatcher), we need to configure a list of tokenizers in first_line_tokenizers.yaml. Internally, this is just another parameter for the matcher; yet, by specifying it in a custom file, we only need to configure the list of tokenizers once for all matchers that use tokens.

TODO

Note that similarity matrix boosting cannot be configured via .yaml files yet. Please see at the beginning of Main.main(), there is a block like this specifying those steps:

SimMatrixBoosting firstLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
SimMatrixBoosting secondLineSimMatrixBoosting = new IdentitySimMatrixBoosting();

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration

Matching Steps

Datasets

Matchers

TODO

Clone this wiki locally