Home

General Information

When you work on a particular step (e.g., a first-line matcher or boosting first-line similarity matrices), it can help to tweak the configurations to your use case as this will greatly reduce runtime. (see Configuration)

When reading a lot of CSV files, installing a plugin for that in your IDE can be really helpful. (e.g., CSV Editor in IntelliJ)

Setup

This project requires Java JDK 17 and Maven >=3.9.2. We suggest developing and running the project in IntelliJ IDEA. For setup, IntelliJ should guide you to install the correct Java JDK and download the Maven dependencies for you. In case of problems with Maven dependencies, try to reload the project.

You can run the project out of the box, as it comes with data and default configurations. For a successful run, you should see this log line at the end of your console:

[INFO ] <timestamp> [main] de.uni_marburg.schematch.Main - Ending Schematch

Project Overview

This project focuses on determining correspondences between attribute pairs of two different database schemata. To do so, it requires two databases represented as a collection of .csv (each file represents a table of its database). We call this problem a (matching) scenario, and Schematch creates a MatchTask for every such scenario.

We require ground truth attribute correspondences to evaluate the matching output of Schematch. Given two example tables, students(id,name,subject) and studierende(sid,full_name), our ground truth attribute correspondences could be (students.id,studierende.sid) and (students.name,studierende.full_name). In Schematch this information is represented as a ground truth matrix:

1	0
0	1
0	0

What we get as output of Schematch is a similarity matrix trying to approximate the ground truth matrix as good as possible, such as:

0.9	0.1
0.03	0.94
0.02	0.04

At the moment, Schematch performs five steps sequentially to produce and improve these similarity matrices:

Table Pair Generation: As a preprocessing step, candidate table pairs are generated. Only for these table pairs, Schematch produces similarity matrices. Note: At the moment we require NaiveTablePairsGenerator for table pair generator, i.e. all possible combinations of source and target tables.
First-Line Matching: A selection of different first-line schema matchers are applied to the candidate table pairs. First-line matchers operate on the input data and do not require any other matcher's output. Each matcher outputs a similarity matrix for the entire matching scenario.
Similarity Matrix Boosting: The output of the first-line matchers is improved using metadata (e.g., data dependencies).
Second-Line Matching: A selection of different second-line schema matchers are applied to the candidate table pairs. Second-line matchers use the improved output of the previous step. A common second-line matcher is an ensemble matcher, i.e., a matcher combining different first-line matchers to produce a new similarity matrix.
Similarity Matrix Boosting: Finally, the output of second-line matchers is improved using metadata.

For each scenario, Schematch creates an instance of MatchTask. This instance holds all information about the current match process, such as the similarity matrices produced by matchers and similarity matrix boosting.

Data is stored in a hierarchical fashion in classes of the data package:

A Dataset consists of one or more instances of Scenario
A Scenario consists of two instances of Database (a source and a target database)
A Database consists of one or more instances of Table
A Table consists of one or more instances of Column

Initially, those data objects only hold what they read from the input files (e.g., table names, column names, schema instance data). Whenever a matcher requires additional (meta)data, that information is added to the data objects on demand and cached for later use by other matchers. Examples are column data types, column value tokens, and multi-column data dependencies.

Getting Started

Adding a new matcher

Any matcher needs to extend matching.Matcher. It can do so by directly extending it and implementing match(MatchTask, MatchingStep) or it can extend its subclasses TablePairMatcher or TokenizedTablePairMatcher which require the implementation of match(TablePair) and call this method for each table pair to match an entire match task.

To add your matcher to the matching process, you need to extend the list of matcher configurations in src/main/resources/first_line_matchers.yaml or src/main/resources/second_line_matchers.yaml. (see Configuration for further information)

Adding a new similarity matrix boosting

Any similarity matrix boosting needs to implement boosting.SimMatrixBoosting. At the moment, to test your similarity matrix boosting, you need to find and adjust these lines in Main.main():

SimMatrixBoosting firstLineSimMatrixBoosting = new IdentitySimMatrixBoosting();
SimMatrixBoosting secondLineSimMatrixBoosting = new IdentitySimMatrixBoosting();

Evaluating Matchers

See Results and Results#hints.

Logging

This project uses log4j for logging. You can add a logger to any class like this (make sure to replace <Your-Class-Name>):

final static Logger log = LogManager.getLogger(<Your-Class-Name>.class);

To log a message (to console and to a log file created in logs/<timestamp>.log), write:

log.info("This is an important message everyone should see");
log.debug("This is a debug message, only necessary for checking details while debugging");
log.trace("This is a very specific message, only necessary for checking step-by-step");
log.warn("This is a warning message");
log.error("This is an error message");

You can change the log level in src/main/resources/log4j2.yaml in line 19: level: INFO. Set it to TRACE to get all log messages produced by Schematch; to DEBUG to get all debug, error, warn, and info logs; or to INFO to only get error, warn, and info logs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly