Extending the Schema.org datasets descriptions for AI usecases #284
Replies: 7 comments
-
|
Thank you for bringing this up as a discussion. I've been curious about AI-ready data and ERDDAP and what things ERDDAP isn't doing to facilitate more AI use. I don't have a significant background in schema.org or how ERDDAP implements it. But, I would like to share some relevant information on this topic. I'm fairly certain ERDDAP uses the schema.org guidance from the ESIP Science On Schema.Org (SOSO) effort (https://github.com/ESIPFed/science-on-schema.org). Maybe, instead of having an ERDDAP specific implementation using croissant, this conversation could be brought to the SOSO group to see how it could be applied/recommended in the broader sense? Then, ERDDAP could adopt from there? Can you share the 3rd party service you created using Python? Is there a GitHub link? CC @ashepherd or @fils if you have insights on the schema.org part. |
Beta Was this translation helpful? Give feedback.
-
|
@MathewBiddle, thanks for the response and sorry for not getting back sooner. The code is not accessible in a public repository. Until we can make our repository public, I can copy the file here. Although it's a bit messy, it should provide an outline of what we built. While re-testing this work, I have found that our code works for Python Croissant version 1.0.12. As you can see, it's built very much around our ERDDAP instance and uses OceanGliders data as the test case. croissant-builder.pyimport mlcroissant as mlc
from erddapy import ERDDAP
import json
from urllib.request import urlopen
class erddapCroissant():
def __init__(self, erddap: ERDDAP):
self.erddap = erddap
self.info = self.erddap._get_variables_uncached()
def buildMetaData(self):
info = self.info
if "NC_GLOBAL" in info:
license = self._dictChecker(info, 'license')
title = self._dictChecker(info, 'title')
info_url = self._dictChecker(info, 'infoUrl')
keywords = self._dictChecker(info, 'keywords')
summary = self._dictChecker(info, 'summary')
example_data = self.downloadExample()
example_data_column = example_data["table"]["columnNames"]
croissant = mlc.Metadata(name=title,
description=summary,
url=info_url,
keywords=keywords,
license=license,
distribution=self.buildDistribution(),
record_sets= self.buildRecordSet(vars=info.keys(),
example_columns=example_data_column,
example_data=example_data
)
)
return croissant
def downloadExample(self) -> dict:
print("Download example data")
example_data = self.erddap.get_download_url()#constraints={'orderByLimit' : '(10)'})
print(example_data+"""&orderByLimit(%2210%22)""")
with urlopen(example_data+"""&orderByLimit(%2210%22)""") as url:
data = json.load(url)
print("example data saved")
return data
@staticmethod
def _dictChecker(dictionary: dict, search_for: str) -> str:
try:
value = dictionary['NC_GLOBAL'][search_for]
except KeyError:
value = " "
return value
def buildDistribution(self) -> any:
print("setting up file object(s)")
file_object = mlc.FileObject(
id=f"{self.erddap.dataset_id}",
content_url=f"{self.erddap.get_download_url()}",
encoding_format="application/json",
ctx=mlc.Context(is_live_dataset=True)
)
return [file_object]
def buildRecordSet(self, vars: list, example_columns: list, example_data: dict):
print("Build Record Sets")
vars_as_list = list(vars)
vars_as_list = self._drop_Items(items=['NC_GLOBAL',
'PLATFORM_MODEL',
'DEPLOYMENT_LONGITUDE',
'DEPLOYMENT_LATITUDE',
'PLATFORM_TYPE',
''],
drop_from=vars_as_list)
variables = []
for var in vars_as_list:
print(f"building for {var}")
details = self.info[var]
name_index = example_columns.index(var)
erddap_datetype = example_data["table"]["columnTypes"][name_index]
if "time" in var:
datatype = self._map_dataTypes("time")
elif "latitude" in var or "longitude" in var or var.endswith("_GPS"):
datatype = self._map_dataTypes("geo")
else:
datatype = self._map_dataTypes(erddap_datetype)
variables.append(mlc.Field(id=f"data/{var}",
name=details["long_name"],
description="",
data_types=datatype,
source= mlc.Source(
file_object=self.erddap.dataset_id,
extract=mlc.Extract(
json_path=f"$.table.rows[*][{name_index}]"
)
))
)
records = mlc.RecordSet(id="data", name="data", examples=example_data, fields=variables)
return [records]
@staticmethod
def _drop_Items(items: list, drop_from: list) -> list:
for item in items:
try:
item_index = drop_from.index(item) #
drop_from.pop(item_index)
except ValueError:
continue
return drop_from
@staticmethod
def _map_dataTypes(erddap_datetype: str) -> mlc.DataType:
if erddap_datetype == "String":
return mlc.DataType.TEXT
elif erddap_datetype == "double":
return mlc.DataType.INTEGER
elif erddap_datetype == "int":
return mlc.DataType.INTEGER
elif erddap_datetype == "float":
return mlc.DataType.FLOAT
elif erddap_datetype == "time":
return mlc.DataType.DATE
elif erddap_datetype == 'geo':
return mlc.DataType.TEXT # TODO croissant is working on better geo support in geocroissant
def main(datasetId: str):
print("Lets Start")
info_set = ERDDAP(server="https://linkedsystems.uk/erddap", protocol="tabledap")
info_set.response = "json"
info_set.dataset_id =f"{datasetId}"
erddapML = erddapCroissant(info_set)
croissant = erddapML.buildMetaData()
print("Hello from croissant!")
print(croissant.issues.report())
import json
with open(f"{datasetId}_croissant.json", "w") as f:
content = croissant.to_json()
content = json.dumps(content, indent=2)
f.write(content)
f.write("\n") # Terminate file with new
if __name__ == "__main__":
main('Ammonite_593_R') |
Beta Was this translation helpful? Give feedback.
-
|
@thogar-computer @MathewBiddle Just wanted to jump in here. I've been talking with Kevin O'brien (NOAA) @kevin-obrien on alignment with the SOSO guidance. We are looking to make a SHACL test for this. I have some place holder work for this at https://github.com/iodepo/odis-in/tree/master/shapeGraphs#erddap (emphasis on "place holder" it's not testing anything important yet) @thogar-computer I have also been sitting in on the Geo-Croissant group which is working on an extension to Croissant to scope in some spatial elements. I'm curious if some of those elements might get scoped into you code. That would be very interesting. Both SOSO and Croissant are profiles of schema.org, and there is a fair bit of overlap between the two profiles. Though it would be a good exercise to see what that overlap is at the required / recommended levels. I'd love to flesh out and coordinate this some more between all the parties if that is of interest to people! |
Beta Was this translation helpful? Give feedback.
-
|
@thogar-computer @fils @MathewBiddle @ashepherd All of this is really interesting and I hope this discussion as well as work moves forward, though I have little to offer in the way of expertise. Please keep posting here. I can say one concern we have if moving forward in ERDDAP is not the end point being desired, but rather is there really an agreed upon standard for this. Or if not can the people here agree on the best one to try going forward. |
Beta Was this translation helpful? Give feedback.
-
|
My quick take is I do think it makes sense to improve the schema information ERDDAP provides. I haven't had a chance to dig into croissant or any other proposed changes yet, so I'm not sure exactly what the changes would be yet. |
Beta Was this translation helpful? Give feedback.
-
|
@thogar-computer @fils I have a pull request adding croissant schema to erddap if you want to take a look and give feedback: #316 |
Beta Was this translation helpful? Give feedback.
-
|
Hi all, This is a great direction, we’ve also been exploring ways to make more datasets truly AI-ready and easily ingestible by modern ML workflows. The work on extending Schema.org via croissant-ml is a step toward bridging structured web data with ML frameworks like TensorFlow and PyTorch. It’s exciting to see how ERDDAP’s existing metadata can be leveraged to generate valid croissant.json files that expose not just descriptive metadata, but also concrete access pathways to the data (e.g., download URLs, data type hints, structure definitions, etc.). We recently prototyped a similar tool as a standalone Python microservice. It uses ERDDAP’s catalog and output endpoints to auto-generate croissant descriptors, making datasets more machine-interoperable by default. Given how extensible ERDDAP already is, this feels like a natural next step to include natively or as a pluggable module — possibly even replacing or complementing the current embedded Schema.org generation. We’d be happy to collaborate or share feedback based on what we’ve learned so far. Best regards, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
ERDDAP can produce Schema.org descriptions of the datasets it hosts. These descriptions follow the Schema.org ontologies, which allow machines (and search engines) to understand the dataset's content.
With the rise of AI and the need for more data to be
AI-ready, there has been a move to extend schema.org to give more information about the data a schema.org dataset describes.This effect can be seen here in the croissant-ml schema
This extension allows for
AI toolssuch as TensorFlow and PyTorch to interact with datasets directly. It is accomplished by croissant detailing how to access the dataset in question, describing the types and locations.During a recent project, we looked at using the information erddap already holds to generate a croissant file. Given limited resources, we opted to build this as a 3rd party service using Python.
Given that ERDDAP can ingest data in many formats and output it in different formats, this allowed us to follow the croissant example of accessing data via URLS as JSON.
I am raising this here as it makes sense for something like this to be a core feature of the ERDDAP and potentially replace the embed schema.org.
Vailed croissant is a valid schema.org.
Beta Was this translation helpful? Give feedback.
All reactions