How to keep all versions of the standard name table available on the web #296
Replies: 11 comments 15 replies
-
@japamment @feggleton Related to the standard name repository. As discussed at the meeting this today, the standard names data is large. GitHub has a limit of 1GB for the published size of the website, distinct from the repository size which is actually quite reasonable for the CF website. The last build of the cf conventions website was 858MB, with the current size of each standard name publication, I suspec there is space remaining for fewer than 10 standard name table publications before we hit this limit and will be unable to build/update the website. As a possible solution, we discussed using a single git repository and utilizing tags for the versioned tables. I have prepared an example of how this could work: https://github.com/DocOtak/cf_standard_names If you browse the code, you will notice that only the "current" version is browsable, the checked out version of this will never be larger than the current size of the most recent name table (currently aprox. 18MB). The different versions have all been tagged: https://github.com/DocOtak/cf_standard_names/tags though I was very sparse on the tag message. If for example if you wanted to see the name table in version 78, you can browse the code: https://github.com/DocOtak/cf_standard_names/tree/78 You can also link directly to the nametable xml: https://github.com/DocOtak/cf_standard_names/raw/78/cf_standard_names/src/cf-standard-name-table.xml The different versions were loaded into the repository sequentially, git itself uses a delta compression for storage. The compressed version of the entire history of the standard names is only 2.2MB. A potential downside of this method is that only the current version of the name table can be hosted directly in github pages, so only the "current" version would have the html interface on the website, though all the previous ones continue to exist within the git history. |
Beta Was this translation helpful? Give feedback.
-
With our current system, each release of the standard name table includes an xml file of 4 Mbyte and html of 5 Mbyte, so that's O(10) Mbyte per release. 100 releases is 1 Gbyte. GitHub says "We recommend repositories remain small, ideally less than 1 GB, and less than 5 GB is strongly recommended." Probably the limit we are reaching is (as @ethanrd said in the meeting and @DocOtak said above), from the website. https://docs.github.com/en/pages/getting-started-with-github-pages/about-github-pages says "GitHub Pages source repositories have a recommended limit of 1 GB. Published GitHub Pages sites may be no larger than 1 GB." That sounds like a hard limit. It could mean that putting the standard names in their own repo would help. Any repo can be published with Pages, and I believe they count as separate websites. Moreover, https://docs.github.com/en/organizations/collaborating-with-groups-in-organizations/about-organizations says "All organizations can own an unlimited number of public and private repositories." So if standard names get too big for one repo, we can have more repos for the old ones. We can publish them all with Pages, and still link them from our main website, I believe. If that's right, perhaps we don't need to change anything fundamentally. |
Beta Was this translation helpful? Give feedback.
-
I think there are several alternative ways to deal with this or, rather several smaller changes that will contribute to towards a solution:
|
Beta Was this translation helpful? Give feedback.
-
There is a third-party service to work around this using a Content Delivery Network: https://githubraw.com I have created an example index, to all files on existing tags, that @DocOtak created: https://cofinoa.github.io/cf-standard-names/index-cdn-githubraw.html Personally, I don't like much this version because it depends on a third-party service. I have also created a Github Pages branch, where all existing tags/versions are dumped for the XML, HTML and KWIC (if exist) files, and linked as: The raw size of the site is ~800MB, but the artifact been uploaded is ~80MB, but I can't find any place on my Github panel, how much storage, the Github Pages of my repo, it's using. |
Beta Was this translation helpful? Give feedback.
-
Dear Antonio @cofinoa Thanks for these experiments.
What is the artifact? The HTML index page you've created is much smaller than 80 Mbyte, but the files it links could be 800 Mbyte. The "artifact" is something intermediate in size, it seems. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
Thanks, Antonio @cofinoa, but do you understand why the artifact is 10 times smaller than the files in the repo (as discussed in https://github.com/orgs/cf-convention/discussions/296#discussioncomment-8756870)? |
Beta Was this translation helpful? Give feedback.
-
Yes, I see. If I click on https://github.com/cofinoa/cf-standard-names/actions/runs/8851239187/artifacts/1451516880 it gets downloaded, and I can inspect the contents using unzip and tar. If I understand correctly, what we'd like to devise is a way to store all versions of the standard name table in git, where they would be compressed, and still be able to view them in a brower as we can do now. That would be much more efficient of space than storing each version as independent documents. It works for xml files from @DocOtak's tagged repo e.g. https://github.com/DocOtak/cf_standard_names/raw/78/cf_standard_names/src/cf-standard-name-table.xml, but it does not work for the corresponding html files e.g. https://github.com/DocOtak/cf_standard_names/raw/83/cf_standard_names/build/cf-standard-name-table.html. Despite the suffix, the file is interpreted by the brower as text/plain. Do you have any idea of a way round that, @cofinoa? Thanks. J |
Beta Was this translation helpful? Give feedback.
-
In conventions issue 127, Lars comments that we might wish to assign a DOI for each version of the standard name table and each version of the other tables of controlled vocabulary, including DOIs which point at all times to the current versions. This is not the same issue as how to keep them on the web (the subject of this discussion), but I guess it may be related, so this is a good place to talk about it. Do we want to do it? |
Beta Was this translation helpful? Give feedback.
-
If we are broadening the discussion, which I think is useful, I suggest there are three more or less related issues that we should consider. In addition to the two already mentioned:
There is a third one
While the first one is of more general nature, the latter are two are connected, in particular if we want to published zipped files, in which case it might be more useful to bundle the xml, html and kwik versions together in one zip file. Without going into too much technical details, I just want to outline what 3 is about. Currently the directory for each standard name table contains a hierarchy of several subdirectories where the different formats ends up at different places together with supporting files that are the same across versions (or indeed not the same despite having the same filename). This is not helpful and can easily cause confusion, as was seen with respect to the XML schema files (see here). It appears that the XLS files for converting the xml version to the http version have the same problem. The same file name appears under the directory for each standard name table version despite substantial change.. I.e. versioning is not used, despite the file name contains a version element. In addition to this there appear to be a problem with respect to what is the current version (see here). I suggest that reorganising the directory structure will help us to save space, get a better overview of different version of the supporting files, and pave the way to more efficient publishing of zipped versions of earlier table versions if we decide so. While the specifics of each of the three issues probably should be dealt with separately, I agree with Jonathan that I would be relevant to first havea more general conversation regarding all this. |
Beta Was this translation helpful? Give feedback.
-
Dear Antonio @cofinoa, Andrew @DocOtak, Lars @larsbarring, Alison @japamment, Fran @feggleton, Ellie @efisher008 This is very clever! Thanks. Your javascript gives us a way for the browser to interpret a data file of html as a web page, without using a third-party service. My understanding is this: Andrew @DocOtak's method uses git for version control of the standard name table files (xml html xls and whatever else is release-dependent), like we do for the conventions document. With this system, there would be just one directory for these files, rather than a separate directory for each release as there is now. Releases are identified by being given a tag. The tag provides a release-specific URL for each released version of each file, constituted on the fly from the repo, without all the versions being stored as static files in the way we have done up to now. This will make the repo much smaller and thus we will no longer have difficulties with the size limit of GitHub Pages websites. We could replace the links to the xml files on the vocabulary page with links to the xml files in the tagged versions of the repo. For the html of the standard name table, and any other file to be rendered as html, we need a small release-specific wrapper html file (not containing the content), which invokes Antonio's javascript to render the file from the tagged version of the repo. Have I understood correctly, @cofinoa and @DocOtak? Would this address the issues you have summarised, Lars? Alison, Fran and Ellie, would you be willing to change to this new method of managing releases of the standard name table? I assume we could do the same with the other controlled vocabularies, but that is not urgent because they take up much less space. Best wishes Jonathan |
Beta Was this translation helpful? Give feedback.
-
I'm just adding myself in thread as this discussion is very interesting to me. Back in 2019, due to changes in github add-on services, I had to remap from an external rawgit service to github-pages (see WCRP-CMIP/CMIP6_CVs#624 for more info). Being able to serve up multiple versions using URLs including the tag sounds like a great use of the system to me, assuming the changes I hit back in ~2019 admittedly to external non-github services don't recur. If I am interpreting this correctly, we could have a URL like wcrp-cmip.github.io/CMIP6_CVs/raw/SOMETAGIDENTIFER/somepage.html (or similar, I am missing the raw.githubusercontent.com etc) With the SOMETAGIDENTIFER toggle-able out with the latest e.g., "6.2.58.68" or some distant past "6.2.0.1" tag, correct? @cofinoa where did you find the info for https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/remoteload.js from? |
Beta Was this translation helpful? Give feedback.
-
Topic for discussion
In the meeting of the information management team yesterday (Monday 11th March 2024), we talked about the problem that the html and xml of the previous versions of the standard name table, currently in the website repo, are taking up most of the 1 Gbyte limit for GitHub pages. We talked about various ways to address this. Among other things, we discussed creating a new
vocabulary
repo.Beta Was this translation helpful? Give feedback.
All reactions