Extending the Schema.org datasets descriptions for AI usecases #284

thogar-computer · 2025-05-06T08:54:12Z

thogar-computer
May 6, 2025

ERDDAP can produce Schema.org descriptions of the datasets it hosts. These descriptions follow the Schema.org ontologies, which allow machines (and search engines) to understand the dataset's content.

With the rise of AI and the need for more data to be AI-ready, there has been a move to extend schema.org to give more information about the data a schema.org dataset describes.

This effect can be seen here in the croissant-ml schema

This extension allows for AI tools such as TensorFlow and PyTorch to interact with datasets directly. It is accomplished by croissant detailing how to access the dataset in question, describing the types and locations.

During a recent project, we looked at using the information erddap already holds to generate a croissant file. Given limited resources, we opted to build this as a 3rd party service using Python.

Given that ERDDAP can ingest data in many formats and output it in different formats, this allowed us to follow the croissant example of accessing data via URLS as JSON.

I am raising this here as it makes sense for something like this to be a core feature of the ERDDAP and potentially replace the embed schema.org.

Vailed croissant is a valid schema.org.

MathewBiddle · 2025-05-06T11:10:28Z

MathewBiddle
May 6, 2025

Thank you for bringing this up as a discussion. I've been curious about AI-ready data and ERDDAP and what things ERDDAP isn't doing to facilitate more AI use.

I don't have a significant background in schema.org or how ERDDAP implements it. But, I would like to share some relevant information on this topic.

I'm fairly certain ERDDAP uses the schema.org guidance from the ESIP Science On Schema.Org (SOSO) effort (https://github.com/ESIPFed/science-on-schema.org). Maybe, instead of having an ERDDAP specific implementation using croissant, this conversation could be brought to the SOSO group to see how it could be applied/recommended in the broader sense? Then, ERDDAP could adopt from there?

Can you share the 3rd party service you created using Python? Is there a GitHub link?

CC @ashepherd or @fils if you have insights on the schema.org part.

0 replies

thogar-computer · 2025-05-15T08:30:14Z

thogar-computer
May 15, 2025
Author

@MathewBiddle, thanks for the response and sorry for not getting back sooner.

The code is not accessible in a public repository. Until we can make our repository public, I can copy the file here. Although it's a bit messy, it should provide an outline of what we built.

While re-testing this work, I have found that our code works for Python Croissant version 1.0.12.

As you can see, it's built very much around our ERDDAP instance and uses OceanGliders data as the test case.
Still, hopefully it's useful as a starter for ten :)

croissant-builder.py

import mlcroissant as mlc
from erddapy import ERDDAP
import json
from urllib.request import urlopen

class erddapCroissant():
    
    def __init__(self, erddap: ERDDAP):
        self.erddap = erddap
        self.info = self.erddap._get_variables_uncached()

    def buildMetaData(self):
        info = self.info
        if "NC_GLOBAL" in info:
           license = self._dictChecker(info, 'license')
           title =  self._dictChecker(info, 'title')
           info_url = self._dictChecker(info, 'infoUrl')
           keywords = self._dictChecker(info, 'keywords')
           summary = self._dictChecker(info, 'summary')
        
        example_data = self.downloadExample()
        example_data_column = example_data["table"]["columnNames"]
        
        croissant =  mlc.Metadata(name=title,
                                  description=summary,
                                  url=info_url,                                  
                                  keywords=keywords,
                                  license=license,                                  
                                  distribution=self.buildDistribution(),
                                  record_sets= self.buildRecordSet(vars=info.keys(),
                                                                   example_columns=example_data_column,
                                                                   example_data=example_data
                                                                   )
                                  )        
        return croissant
    
    def downloadExample(self) -> dict:
        print("Download example data")
        example_data = self.erddap.get_download_url()#constraints={'orderByLimit' : '(10)'})
        print(example_data+"""&orderByLimit(%2210%22)""")
        with urlopen(example_data+"""&orderByLimit(%2210%22)""") as url:
            data = json.load(url)
        print("example data saved")
        return data
        
    @staticmethod
    def _dictChecker(dictionary: dict, search_for: str) -> str:
        
        try:
            value = dictionary['NC_GLOBAL'][search_for]
        except KeyError:
            value = " "
        
        return value
        
    def buildDistribution(self) -> any:
        print("setting up file object(s)")
        file_object = mlc.FileObject(
                    id=f"{self.erddap.dataset_id}",
                    content_url=f"{self.erddap.get_download_url()}",
                    encoding_format="application/json",                    
                    ctx=mlc.Context(is_live_dataset=True)
                   
                   )
        return [file_object]
    
    def buildRecordSet(self, vars: list, example_columns: list, example_data: dict):
        print("Build Record Sets")
        vars_as_list = list(vars)
        
        vars_as_list = self._drop_Items(items=['NC_GLOBAL',
                                               'PLATFORM_MODEL',
                                               'DEPLOYMENT_LONGITUDE',
                                               'DEPLOYMENT_LATITUDE',
                                               'PLATFORM_TYPE',
                                               ''], 
                                        drop_from=vars_as_list)
                      
        variables = []        
        for var in vars_as_list:
            print(f"building for {var}")
            details = self.info[var]
            name_index = example_columns.index(var)
            erddap_datetype = example_data["table"]["columnTypes"][name_index]
            
            if "time" in var:
                datatype = self._map_dataTypes("time")
            elif "latitude" in var or "longitude" in var or var.endswith("_GPS"):
                datatype = self._map_dataTypes("geo")
            else:
                datatype = self._map_dataTypes(erddap_datetype)
            
            
            variables.append(mlc.Field(id=f"data/{var}",
                                       name=details["long_name"],
                                       description="",
                                       data_types=datatype,                                       
                                       source= mlc.Source(
                                           file_object=self.erddap.dataset_id,                                           
                                           extract=mlc.Extract(                                                                                          
                                               json_path=f"$.table.rows[*][{name_index}]"
                                               )                                      
                                       ))
                            )
        records = mlc.RecordSet(id="data", name="data", examples=example_data, fields=variables)
        return [records]
    
    @staticmethod
    def _drop_Items(items: list, drop_from: list) -> list:
        
        for item in items:
            try:
                item_index = drop_from.index(item) # 
                drop_from.pop(item_index)
            except ValueError:
                continue
        return drop_from
            
        
    @staticmethod
    def _map_dataTypes(erddap_datetype: str) -> mlc.DataType:
        if erddap_datetype == "String":
            return mlc.DataType.TEXT
        elif erddap_datetype == "double":
            return mlc.DataType.INTEGER
        elif erddap_datetype == "int":
            return mlc.DataType.INTEGER
        elif erddap_datetype == "float":
            return mlc.DataType.FLOAT
        elif erddap_datetype == "time":
            return mlc.DataType.DATE
        elif erddap_datetype == 'geo':
            return mlc.DataType.TEXT # TODO croissant is working on better geo support in geocroissant
     
                                
def main(datasetId: str):
    print("Lets Start")
    info_set = ERDDAP(server="https://linkedsystems.uk/erddap", protocol="tabledap")
    info_set.response = "json"
    info_set.dataset_id =f"{datasetId}"
    erddapML = erddapCroissant(info_set)    
    croissant = erddapML.buildMetaData()
    
    print("Hello from croissant!")
    print(croissant.issues.report())
    import json

    with open(f"{datasetId}_croissant.json", "w") as f:
        content = croissant.to_json()
        content = json.dumps(content, indent=2)
        f.write(content)
        f.write("\n")  # Terminate file with new
    

if __name__ == "__main__":    
    main('Ammonite_593_R')

0 replies

fils · 2025-05-15T10:57:01Z

fils
May 15, 2025

@thogar-computer @MathewBiddle

Just wanted to jump in here. I've been talking with Kevin O'brien (NOAA) @kevin-obrien on alignment with the SOSO guidance. We are looking to make a SHACL test for this. I have some place holder work for this at https://github.com/iodepo/odis-in/tree/master/shapeGraphs#erddap (emphasis on "place holder" it's not testing anything important yet)

@thogar-computer I have also been sitting in on the Geo-Croissant group which is working on an extension to Croissant to scope in some spatial elements. I'm curious if some of those elements might get scoped into you code. That would be very interesting.

Both SOSO and Croissant are profiles of schema.org, and there is a fair bit of overlap between the two profiles. Though it would be a good exercise to see what that overlap is at the required / recommended levels.

I'd love to flesh out and coordinate this some more between all the parties if that is of interest to people!

0 replies

rmendels · 2025-05-15T14:34:22Z

rmendels
May 15, 2025
Collaborator

@thogar-computer @fils @MathewBiddle @ashepherd All of this is really interesting and I hope this discussion as well as work moves forward, though I have little to offer in the way of expertise. Please keep posting here. I can say one concern we have if moving forward in ERDDAP is not the end point being desired, but rather is there really an agreed upon standard for this. Or if not can the people here agree on the best one to try going forward.

0 replies

ChrisJohnNOAA · 2025-05-15T20:26:11Z

ChrisJohnNOAA
May 15, 2025
Maintainer

My quick take is I do think it makes sense to improve the schema information ERDDAP provides. I haven't had a chance to dig into croissant or any other proposed changes yet, so I'm not sure exactly what the changes would be yet.

0 replies

ChrisJohnNOAA · 2025-07-15T16:44:46Z

ChrisJohnNOAA
Jul 15, 2025
Maintainer

@thogar-computer @fils I have a pull request adding croissant schema to erddap if you want to take a look and give feedback: #316

0 replies

production23 · 2025-07-21T00:50:49Z

production23
Jul 21, 2025

Hi all,

This is a great direction, we’ve also been exploring ways to make more datasets truly AI-ready and easily ingestible by modern ML workflows.

The work on extending Schema.org via croissant-ml is a step toward bridging structured web data with ML frameworks like TensorFlow and PyTorch. It’s exciting to see how ERDDAP’s existing metadata can be leveraged to generate valid croissant.json files that expose not just descriptive metadata, but also concrete access pathways to the data (e.g., download URLs, data type hints, structure definitions, etc.).

We recently prototyped a similar tool as a standalone Python microservice. It uses ERDDAP’s catalog and output endpoints to auto-generate croissant descriptors, making datasets more machine-interoperable by default. Given how extensible ERDDAP already is, this feels like a natural next step to include natively or as a pluggable module — possibly even replacing or complementing the current embedded Schema.org generation.

We’d be happy to collaborate or share feedback based on what we’ve learned so far.

Best regards,
EvoLearns Team
ir@evolearns.com

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extending the Schema.org datasets descriptions for AI usecases #284

Uh oh!

{{title}}

Uh oh!

Replies: 7 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extending the Schema.org datasets descriptions for AI usecases #284

Uh oh!

thogar-computer May 6, 2025

Replies: 7 comments

Uh oh!

MathewBiddle May 6, 2025

Uh oh!

thogar-computer May 15, 2025 Author

croissant-builder.py

Uh oh!

fils May 15, 2025

Uh oh!

rmendels May 15, 2025 Collaborator

Uh oh!

ChrisJohnNOAA May 15, 2025 Maintainer

Uh oh!

ChrisJohnNOAA Jul 15, 2025 Maintainer

Uh oh!

production23 Jul 21, 2025

thogar-computer
May 6, 2025

MathewBiddle
May 6, 2025

thogar-computer
May 15, 2025
Author

fils
May 15, 2025

rmendels
May 15, 2025
Collaborator

ChrisJohnNOAA
May 15, 2025
Maintainer

ChrisJohnNOAA
Jul 15, 2025
Maintainer

production23
Jul 21, 2025