Thesis Topic ‐ Developing an algorithm to assess the semantic similarity between two JSON Schemas

JSON is the most common data format for data exchange in the web. JSON documents are human readable, as well as machine readable.

Example JSON Document:

{
  "name": "Alex",
  "age": 25
}

JSON Schema is a schema language for specifying how a JSON document has to look like in order to be valid. It defines the required or optional structure, as well as constraints and conditions for a JSON document.

Example JSON Schema:

{
  "type": "object",
  "properties": 
  {
    "name":
    {
      "type": "string"
    },
    "age":
    {
      "type": "integer",
      "minimum": 0
    }
  },
  "required": ["name"]
}

For many JSON-Schema-related tasks (e.g., creation of JSON Schemas via LLM or conversion of schemas in other schema languages to JSON Schema) one major difficulty is evaluating whether the results are good. One approach for such evaluations is to prepare a "Ground truth" schema (e.g., a schema that we would expect to result based on the given task) for each task and then assess the similarity of the resulting schema compared to the ground truth schema. For example, if a method about generating schemas with LLM is developed, an example task could be "Generate a JSON schema from this prompt: the schema is about a Person and models their name (one field) and age. The name is mandatory". The ground truth schema would then be the schema from above. The response from the LLM might give an identical schema, but the response might also differ. For example, it could be the following:

{
  "title": "Person",
  "type": "object",
  "properties": 
  {
    "name":
    {
      "type": "string"
    },
    "age":
    {
      "type": "integer",
    }
  },
  "required": ["name"]
}

The differences here are that the generated schema additionally has a title and the age property is missing the minimum constraint. The schema similarity could now be compared by a human, although even then a clearly structured approach is needed to avoid personal bias (or at least to have the same bias for each comparison). For an automated similarity assessment of two JSON Schemas there is no existing solution, to the best of our knowledge. To compare the similarity of two JSON documents, one existing work introduces the JSON Edit Distance (JEDI) [1]. For JSON Schema, there is existing work on Witness Generation [2], which allows to determine semantic equivalence of schemas. Also the work of Andrew Habib et. al compares to schemata, although they do not handle all schema keywords [3]. Thanks to Stefan Klessinger for compiling the related work.

A naïve approach to compare two JSON Schemas could be to compare them based on their structure and to (depending on the task) ignore purely semantic properties, such as title and description. For example by iterating through all leafs in the ground truth JSON schema document tree and checking whether the same properties exist in the generated document and whether they have the same values. This would identify wrong or missing values in the generated document. Iterating over the generated document could help identify false positives. However, this approach quickly falls apart, if the consider the expressiveness of JSON schema. The following schema is semantically identical to the above schema but structurally very different, due to using references:

{
  "type": "object",
  "$defs": 
  {
    "nameDefinition":
    {
      "type": "string"
    },
    "ageDefinition":
    {
      "type": "integer",
      "minimum": 0
    }
  },
  "properties": 
  {
    "name":
    {
      "$ref": "#/$defs/nameDefinition"
    },
    "age":
    {
      "$ref": "#/$defs/ageDefinition",
    }
  },
  "required": ["name"]
}

This could also be solved by resolving the references (e.g., moving the definition content to the places where the references are). However, this goes deeper. The following schema also is semantically identical:

{
  "not":
  {
    "not":
    {
	  "type": "object",
	  "properties": 
	  {
	    "name":
	    {
	      "type": "string"
	    },
	    "age":
	    {
	      "type": "integer",
	      "minimum": 0
	    }
	  },
	  "required": ["name"]      
    }
  }
}

Same, as the following schema:

{
  "title": "Person",
  "anyOf": [
    {
      "type": "object",
      "properties": {
        "name": { "type": "string" }
      },
      "required": ["name"]
    },
    {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "age": { "type": "integer" }
      },
      "required": ["name", "age"]
    }
  ]
}

Hence the challenge of this thesis is predominantly of logical nature. Possibly to determine the semantic similarity, it might be useful to normalize the JSON schema in a certain way, for example by bringing it into the disjunctive normal form as done in [2]. Related source code: https://github.com/sdbs-uni-p/JSONSchemaWitnessGeneration. The paper regarding JEDI might also be a useful reference [1].

While naïve similarity comparisons are used in current work to evaluate the performance for JSON-schema-related tasks, a new sophisticated algorithm for semantic similarity assessment would help make these evaluations much more reliable and trustworthy.

[1] Thomas Hütter, Nikolaus Augsten, Christoph M. Kirsch, Michael J. Carey, Chen Li. 2022. JEDI: These aren't the JSON documents you're looking for.... In Proc. SIGMOD. 1584-1597
[2] Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, Stefanie Scherzinger. 2022. Witness Generation for JSON Schema. In Proc. VLDB Endow. 15(13): 4002-4014
[3] Andrew Habib, Avraham Shinnar, Martin Hirzel, and Michael Pradel. 2021. Finding Data Compatibility Bugs with JSON Subschema Checking. In Proc. ISSTA. 620–632

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thesis Topic ‐ Developing an algorithm to assess the semantic similarity between two JSON Schemas

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally