Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include isArray and arrayShape #807

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Include isArray and arrayShape #807

wants to merge 7 commits into from

Conversation

ccl-core
Copy link
Contributor

@ccl-core ccl-core commented Feb 12, 2025

Addresses: #649

Note that this is a bit different than the original proposal in #649 :

{
          "@type": "cr:Field",
          "@id": "recordset/fieldname",
          "isArray": true,
          "arrayShape": [
            1,
            2,
            1
          ]
       ...
}

This is a bit difficult for us to process using mlcroissant... When we create the graph from the jsonld data, we would loose the order information; moreover, as we create a triple for each element in the arrayShape, and triples need to be unique, we would loose dimensions which have the same size.

As an alternative I propose to use a @list, so we keep the order info more nicely and we don't loose data:

{
          "@type": "cr:Field",
          "@id": "recordset/fieldname",
          "isArray": true,
          "arrayShape": {
            "@list": [
              1,
              2,
              1
            ]
          },
       ...
}

WDYT?

In this PR:

Still TODO: Remove repeated from all datasets and use isArray and arrayShape instead.

@ccl-core ccl-core requested a review from a team as a code owner February 12, 2025 13:01
Copy link

github-actions bot commented Feb 12, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@@ -106,6 +113,11 @@
"name": "record_set_plain_text/split",
"description": "Split to which the example belongs to.",
"dataType": "sc:Text",
"references": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: references coming after source was more readable.

@@ -27,6 +27,7 @@ def ML_COMMONS(ctx) -> rdflib.Namespace:
return ML_COMMONS_V_1_0


ML_COMMONS_ARRAY_SHAPE = lambda ctx: ML_COMMONS(ctx).arrayShape
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we write a test that the verifier fails if isArray is specified with no arrayShape? Or is this legit in the specs?

If there's an implicit behaviour that arrayShape defaults to (-1,), we could probably document this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the discussion with Omar:

  • isArray without arrayShape is valid, and in that case arrayShape defaults to (-1,). The documentation should specify this behaviour.
  • arrayShape without isArray should make the verifier fail.

@benjelloun
Copy link
Contributor

As we discussed offline, arrayShape can be represented more compactly as a string that contains space separated dimensions, e.g., "1 2 -1 3", where -1 corresponds to a dimension where the length is not specified. Even though this approach is encoding structured information in a string, it's unambiguous and easy to parse, so I think it's acceptable. We already do accept such input in other places, e.g., for bounding boxes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants