Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

records: CMS 2016 SIM record skeletons #3692

Merged

Conversation

tiborsimko
Copy link
Member

@tiborsimko tiborsimko commented Oct 25, 2024

  • feat(skeletons): add CMS 2016 SIM record skeletons

    This commit introduces CMS 2016 SIM record skeletons containing only
    persistent identifiers (title, record ID, DOI). The full record content
    is not stored in this Git repository due to its size (2.3 GB). The
    records are available in a separate tarball located at
    /eos/opendata/cms/upload/tibor/cms-2016-sim-20241025.zip.

  • ci(check-fixtures): check also record skeletons for persistent IDs

    Check also record skeletons with respect to the record ID and the DOI
    uniqueness.

  • ci(check-fixtures): parallelise fixture checking commands

    Introduces several independent run-tests.sh fixture-checking commands
    in order to speed up fixture checking by parallelisation.

    Renames run-tests.sh script options and CI rules to better separate
    data checks, formatting checks and linting checks.

    Adds data formatting checks and fixes several JSON data files.

    Adds shfmt formatting checks, commitlint, flake8andyamllint`
    linting checks.

    Removes pydocstyle formating checks since we moved to black code
    formatter.

    Introduces /run-tests.sh --help explaining all the checking options.

    Updates CI environment to Ubuntu 24.04 and latest actions
    (actions/checkout@v4, actions/setup-node@v4,
    actions/setup-python@v5).

    Amends .editorconfig to add rules for shell scripts and remove rules
    for ReST files that are no longer needed after switch to Markdown.

    BREAKING CHANGE: Refactors run-tests.sh script options.

Closes #3667

@tiborsimko tiborsimko force-pushed the cms-2016-sim-record-skeletons branch 2 times, most recently from b25cc21 to 952b104 Compare January 10, 2025 08:54
@tiborsimko tiborsimko self-assigned this Jan 10, 2025
@tiborsimko tiborsimko force-pushed the cms-2016-sim-record-skeletons branch 13 times, most recently from 68bea91 to e9c16fc Compare January 10, 2025 13:45
@tiborsimko tiborsimko force-pushed the cms-2016-sim-record-skeletons branch 4 times, most recently from 7a66f99 to 83b22ec Compare January 21, 2025 17:13
@psaiz
Copy link
Contributor

psaiz commented Jan 27, 2025

Thanks for the ticket.
I'm very concerned about the implications of this approach. This would be the second set of files that are not stored on github, and the approach for each of them seems to be different. Putting them in a directory were they are mixed with other files does not seem the best strategy.

What about putting the content of those entries on a different github repo, something like opendata_cms_2016? And we could do the same thing for the 40k entries that are not on any repo.

Then, on top of that, the current skeleton file is very big (220k lines). What about splitting it in multiple files,

@tiborsimko tiborsimko force-pushed the cms-2016-sim-record-skeletons branch from 83b22ec to e778a5b Compare February 3, 2025 10:44
@tiborsimko
Copy link
Member Author

This would be the second set of files that are not stored on github, and the approach for each of them seems to be different.

Yes, this is akin to using git lfs to store large files outside of the repository, but without actually using git lfs itself. Since more large open data releases are coming, we may want to rediscuss whether we would like to use git repositories per se, or move to relying on the database as the single source of truth for storing records.

Putting them in a directory were they are mixed with other files does not seem the best strategy.

They are isolated from the rest, so no mixing intended. I have now put the zip file to /eos/opendata/cms/upload/opendata.cern.ch/data/records/cms-2016-sim-20241025.zip to make the separation clearer.

What about putting the content of those entries on a different github repo, something like opendata_cms_2016? And we could do the same thing for the 40k entries that are not on any repo.

Let's discuss IRLsee whether the time has not come to start using DB as the SSOT, considering that we might have a use case for dynamic submission of LHCb ntuple records soon.

Then, on top of that, the current skeleton file is very big (220k lines). What about splitting it in multiple files

The skeleton file is used only for "reserving" record IDs and DOIs, so it is not really consulted by humans. I could split it using the same structure as the zip file has, if desired:

cms-simulated-datasets-2016-part_01.json
cms-simulated-datasets-2016-part_02.json
...
cms-simulated-datasets-2016-part_87.json

This could provide a better one-to-one correspondence between "skeleton" files and "real record" files.

OTOH, some of the "reserved" record IDs were not released as open data during the curation process, due to being found invalid; so skeletons currenly contain more than what was released. I can try to separate them...

@tiborsimko tiborsimko force-pushed the cms-2016-sim-record-skeletons branch 2 times, most recently from 7bcc339 to 3a2c10b Compare February 3, 2025 13:30
This commit introduces CMS 2016 SIM record skeletons containing only
persistent identifiers (title, record ID, DOI). The full record content
is not stored in this Git repository due to its size (2.3 GB). The
records are available in a separate tarball located at
`root://eospublic.cern.ch//eos/opendata/cms/upload/opendata.cern.ch/
data/records/cms-2016-sim-20241025.zip`.

Note that some of the CMS 2016 SIM records were not released, e.g. due
to the dataset being found invalid. The skeleton file
`cms-simulated-datasets-2016-unreleased.json` keeps their
originally-designed record IDs an DOIs just to keep track, even though
there were not used and/or registered. Good to avoid any possible
mis-attribution.

Closes cernopendata#3667
Check also record skeletons with respect to the record ID and the DOI
uniqueness.

Closes cernopendata#3667
Introduces several independent `run-tests.sh` fixture-checking commands
in order to speed up fixture checking by parallelisation.

Renames `run-tests.sh` script options and CI rules to better separate
data checks, formatting checks and linting checks.

Adds data formatting checks and fixes several JSON data files.

Adds `shfmt` formatting checks, `commitlint`, flake8` and `yamllint`
linting checks.

Removes `pydocstyle` formating checks since we moved to `black` code
formatter.

Introduces `/run-tests.sh --help` explaining all the checking options.

Updates CI environment to Ubuntu 24.04 and latest actions
(`actions/checkout@v4`, `actions/setup-node@v4`,
`actions/setup-python@v5`).

Amends `.editorconfig` to add rules for shell scripts and remove rules
for ReST files that are no longer needed after switch to Markdown.

BREAKING CHANGE: Refactors `run-tests.sh` script options.
@tiborsimko tiborsimko force-pushed the cms-2016-sim-record-skeletons branch from 3a2c10b to 5e68444 Compare February 3, 2025 13:31
@tiborsimko
Copy link
Member Author

OTOH, some of the "reserved" record IDs were not released as open data during the curation process, due to being found invalid; so skeletons currenly contain more than what was released. I can try to separate them...

Done. The record part files cms-simulated-datasets-2016-part_NN.json created, and the unreleased ones kept separate in the cms-simulated-datasets-2016-unreleased.json file.

@tiborsimko tiborsimko merged commit 5e68444 into cernopendata:master Feb 3, 2025
16 checks passed
@tiborsimko tiborsimko deleted the cms-2016-sim-record-skeletons branch February 3, 2025 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CMS: add DOIs for 2016 MC
2 participants