-
Notifications
You must be signed in to change notification settings - Fork 4
Home
When you work on a particular step (e.g., a first-line matcher or boosting first-line similarity matrices), it can help to tweak the configurations to your use case as this will greatly reduce runtime. (see Configuration)
When reading a lot of CSV files, installing a plugin for that in your IDE can be really helpful. (e.g., CSV Editor in IntelliJ)
This project requires Java JDK 17 and Maven >=3.9.2. We suggest developing and running the project in IntelliJ IDEA. For setup, IntelliJ should guide you to install the correct Java JDK and download the Maven dependencies for you. In case of problems with Maven dependencies, try to reload the project.
You can run the project out of the box, as it comes with data and default configurations. For a successful run, you should see this log line at the end of your console:
[INFO ] <timestamp> [main] de.uni_marburg.schematch.Main - Ending Schematch
This project focuses on determining correspondences between attribute pairs of two different database schemata.
To do so, it requires two databases represented as a collection of .csv
(each file represents a table of its database).
We call this problem a (matching) scenario, and Schematch creates a MatchTask
for every such scenario.
We require ground truth attribute correspondences to evaluate the matching output of Schematch.
Given two example tables, students(id,name,subject)
and studierende(sid,full_name)
, our ground truth attribute
correspondences could be (students.id,studierende.sid)
and (students.name,studierende.full_name)
.
In Schematch this information is represented as a ground truth matrix:
1 | 0 |
---|---|
0 | 1 |
0 | 0 |
What we get as output of Schematch is a similarity matrix trying to approximate the ground truth matrix as good as possible, such as:
0.9 | 0.1 |
---|---|
0.03 | 0.94 |
0.02 | 0.04 |
At the moment, Schematch performs five steps sequentially to produce and improve these similarity matrices:
-
Table Pair Generation: As a preprocessing step, candidate table pairs are generated. Only for these table pairs,
Schematch produces similarity matrices.
Note: At the moment we require
NaiveTablePairsGenerator
for table pair generator, i.e. all possible combinations of source and target tables. - First-Line Matching: A selection of different first-line schema matchers are applied to the candidate table pairs. First-line matchers operate on the input data and do not require any other matcher's output. Each matcher outputs a similarity matrix for the entire matching scenario.
- Similarity Matrix Boosting: The output of the first-line matchers is improved using metadata (e.g., data dependencies).
- Second-Line Matching: A selection of different second-line schema matchers are applied to the candidate table pairs. Second-line matchers use the improved output of the previous step. A common second-line matcher is an ensemble matcher, i.e., a matcher combining different first-line matchers to produce a new similarity matrix.
- Similarity Matrix Boosting: Finally, the output of second-line matchers is improved using metadata.
For each scenario, Schematch creates an instance of MatchTask
. This instance holds all information
about the current match process, such as the similarity matrices produced by matchers and similarity matrix boosting.
Data is stored in a hierarchical fashion in classes of the data
package:
- A
Dataset
consists of one or more instances ofScenario
- A
Scenario
consists of two instances ofDatabase
(a source and a target database) - A
Database
consists of one or more instances ofTable
- A
Table
consists of one or more instances ofColumn
Initially, those data objects only hold what they read from the input files (e.g., table names, column names, schema instance data). Whenever a matcher requires additional (meta)data, that information is added to the data objects on demand and cached for later use by other matchers. Examples are column data types, column value tokens, and multi-column data dependencies.
Any matcher needs to extend matching.Matcher
. It can do so by directly extending it and implementing match(MatchTask, MatchingStep)
or it can extend its subclasses TablePairMatcher
or TokenizedTablePairMatcher
which require the implementation of match(TablePair)
and call this method for each table pair to match an entire match task.
To add your matcher to the matching process, you need to extend the list of matcher configurations in src/main/resources/first_line_matchers.yaml
or src/main/resources/second_line_matchers.yaml
. (see Configuration for further information)
Any similarity matrix boosting needs to implement boosting.SimMatrixBoosting
. At the moment, to
test your similarity matrix boosting, you need to find and adjust these lines in Main.main()
:
SimMatrixBoosting firstLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
SimMatrixBoosting secondLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
See Results and Results#hints.
This project uses log4j for logging. You can add a logger to any class like this (make sure to replace <Your-Class-Name>
):
final static Logger log = LogManager.getLogger(<Your-Class-Name>.class);
To log a message (to console and to a log file created in logs/<timestamp>.log
), write:
log.info("This is an important message everyone should see");
log.debug("This is a debug message, only necessary for checking details while debugging");
log.trace("This is a very specific message, only necessary for checking step-by-step");
log.warn("This is a warning message");
log.error("This is an error message");
You can change the log level in src/main/resources/log4j2.yaml
in line 19: level: INFO
.
Set it to TRACE
to get all log messages produced by Schematch; to DEBUG
to get all debug, error, warn, and info logs;
or to INFO
to only get error, warn, and info logs.