How to keep all versions of the standard name table available on the web #296

JonathanGregory · 2024-03-12T08:56:44Z

JonathanGregory
Mar 12, 2024
Maintainer

Topic for discussion

In the meeting of the information management team yesterday (Monday 11th March 2024), we talked about the problem that the html and xml of the previous versions of the standard name table, currently in the website repo, are taking up most of the 1 Gbyte limit for GitHub pages. We talked about various ways to address this. Among other things, we discussed creating a new vocabulary repo.

JonathanGregory · 2024-03-12T08:57:12Z

JonathanGregory
Mar 12, 2024
Maintainer Author

@DocOtak posted as follows:

@japamment @feggleton Related to the standard name repository.

As discussed at the meeting this today, the standard names data is large. GitHub has a limit of 1GB for the published size of the website, distinct from the repository size which is actually quite reasonable for the CF website. The last build of the cf conventions website was 858MB, with the current size of each standard name publication, I suspec there is space remaining for fewer than 10 standard name table publications before we hit this limit and will be unable to build/update the website.

As a possible solution, we discussed using a single git repository and utilizing tags for the versioned tables. I have prepared an example of how this could work: https://github.com/DocOtak/cf_standard_names

If you browse the code, you will notice that only the "current" version is browsable, the checked out version of this will never be larger than the current size of the most recent name table (currently aprox. 18MB).

The different versions have all been tagged: https://github.com/DocOtak/cf_standard_names/tags though I was very sparse on the tag message.

If for example if you wanted to see the name table in version 78, you can browse the code: https://github.com/DocOtak/cf_standard_names/tree/78 You can also link directly to the nametable xml: https://github.com/DocOtak/cf_standard_names/raw/78/cf_standard_names/src/cf-standard-name-table.xml

The different versions were loaded into the repository sequentially, git itself uses a delta compression for storage. The compressed version of the entire history of the standard names is only 2.2MB.

A potential downside of this method is that only the current version of the name table can be hosted directly in github pages, so only the "current" version would have the html interface on the website, though all the previous ones continue to exist within the git history.

0 replies

JonathanGregory · 2024-03-12T09:24:43Z

JonathanGregory
Mar 12, 2024
Maintainer Author

With our current system, each release of the standard name table includes an xml file of 4 Mbyte and html of 5 Mbyte, so that's O(10) Mbyte per release. 100 releases is 1 Gbyte. GitHub says "We recommend repositories remain small, ideally less than 1 GB, and less than 5 GB is strongly recommended." Probably the limit we are reaching is (as @ethanrd said in the meeting and @DocOtak said above), from the website. https://docs.github.com/en/pages/getting-started-with-github-pages/about-github-pages says "GitHub Pages source repositories have a recommended limit of 1 GB. Published GitHub Pages sites may be no larger than 1 GB." That sounds like a hard limit. It could mean that putting the standard names in their own repo would help. Any repo can be published with Pages, and I believe they count as separate websites.

Moreover, https://docs.github.com/en/organizations/collaborating-with-groups-in-organizations/about-organizations says "All organizations can own an unlimited number of public and private repositories." So if standard names get too big for one repo, we can have more repos for the old ones. We can publish them all with Pages, and still link them from our main website, I believe. If that's right, perhaps we don't need to change anything fundamentally.

2 replies

DocOtak Mar 13, 2024
Maintainer

The kwic index now being included adds another 8 to 9 MB to each release, so the size really is around 19MB and will only grow with time.

I did a lot of digging trying to find an answer to how the size is counted, the only real messages I could find from github staffers to these type of questions are that these limits are enforced as needed at their discretion. I could find no information on if the total published size is a combination of all the sites in some organization. I suspect this ambiguity is intentional to prevent real abusive behavior that is "technically allowed".

I do have a concern with the split repository approach in that it is being proposed explicitly to overcome a restriction on GitHub. The current practice of versioning using directories is how you would do versioning in a file system on a disk, but in git (as separate from GitHub) you really are meant to use the commit references and tagging for versioning.

ethanrd Mar 13, 2024
Maintainer

While size restrictions are one consideration, more important for me is the separation of workflows. CF standard names (vocabularies) have a very different release cycle than either the CF conventions document or the CF website. I suspect moving them to a separate repo would make the workflows cleaner and the release schedules easier to see and track.

larsbarring · 2024-03-13T13:05:22Z

larsbarring
Mar 13, 2024
Collaborator

I think there are several alternative ways to deal with this or, rather several smaller changes that will contribute to towards a solution:

I think that the KWIK index is only needed for the current version.
The XML files, which really are meant for machine processing, can be zipped (or are there actually important use cases polling old version of the tables, and that cannot be modified to download and unzipped a zipfile?)
Also, older versions of the html file could be made available as zipped files that require download. See attached text file for a of list the size of zipped XML, HTML, KWIK files per version.
In particular, if we provide another means for looking at the "history" of standard names then point 3. becomes viable. In fact, this is one of the reasons why I am trying to sort out the formatting of, and issues in the old versions of the standard name table (cf io/#457). Below is a screen clip of how it could look like in a html file showing the history if standard names:

size.txt

9 replies

JonathanGregory Mar 14, 2024
Maintainer Author

Can you publish https://github.com/DocOtak/cf_standard_names as a GitHub pages site? Maybe https://DocOtak.github.io/cf_standard_names/raw/83/cf_standard_names/build/cf-standard-name-table.html would send it as html then?

DocOtak Mar 14, 2024
Maintainer

Only contents of the current version (what git calls the HEAD) would be published on a github pages site. So this proposal would require a compromise of only having the "latest" versions being published nicely, basically what we link to on the home page. While the historic versions would be linked to if you really need them. My feeling is that most people don't actually need the historic html files, but the XML which can be linked to and downloaded.

I admit this is not as nice as what is currently done with all the historic versions being available. In my opinion, the history as presented in the screen shot that @larsbarring has could be realized for all the names would be far more useful to have online rather than the actual historic lists.

JonathanGregory Mar 15, 2024
Maintainer Author

If I save the page of html produced by https://github.com/DocOtak/cf_standard_names/raw/83/cf_standard_names/build/cf-standard-name-table.html and then browse the saved file, it looks fine, except for broken links to icons. Isn't there some way to tell the browser to interpret text/plain as text/html when it comes from a particular URL? I am wondering whether this kind of thing can be done with a Chrome extension, but I don't know how those work.

JonathanGregory Mar 15, 2024
Maintainer Author

It looks like https://stackoverflow.com/questions/8323946/how-can-i-edit-chrome-mime-type-mappings and https://chromewebstore.google.com/detail/modify-content-type/jnfofbopfpaoeojgieggflbpcblhfhka might have this kind of aim, but I haven't made it work yet on https://github.com/DocOtak/cf_standard_names/raw/83/cf_standard_names/build/cf-standard-name-table.html

larsbarring Mar 15, 2024
Collaborator

Thanks Andrew @DocOtak, I think your way of using is really useful! At the same time, I am wondering if we are kind of pushing the intended/accepted use of github too far if we are serving the html this way on the CF website? Another possible drawback would be if the html only could be viewed using Chrome with an extension because many organisations do not easily accept extension. Having said that, if we can get to work then it would be brilliant.

cofinoa · 2024-04-26T16:30:31Z

cofinoa
Apr 26, 2024
Maintainer

@JonathanGregory GitHub sends the following header when you goto the raw file:
content-type: text/plain; charset=utf-8
I don't think it is possible for that to be something else, at least from the GH side.

There is a third-party service to work around this using a Content Delivery Network: https://githubraw.com

I have created an example index, to all files on existing tags, that @DocOtak created: https://cofinoa.github.io/cf-standard-names/index-cdn-githubraw.html

Personally, I don't like much this version because it depends on a third-party service.

I have also created a Github Pages branch, where all existing tags/versions are dumped for the XML, HTML and KWIC (if exist) files, and linked as:
https://cofinoa.github.io/cf-standard-names/index-gh-pages.html

The raw size of the site is ~800MB, but the artifact been uploaded is ~80MB, but I can't find any place on my Github panel, how much storage, the Github Pages of my repo, it's using.

0 replies

JonathanGregory · 2024-04-28T21:36:38Z

JonathanGregory
Apr 28, 2024
Maintainer Author

Dear Antonio @cofinoa

Thanks for these experiments.

The raw size of the site is ~800MB, but the artifact been uploaded is ~80MB, but I can't find any place on my Github panel, how much storage, the Github Pages of my repo, it's using.

What is the artifact? The HTML index page you've created is much smaller than 80 Mbyte, but the files it links could be 800 Mbyte. The "artifact" is something intermediate in size, it seems.

Best wishes

Jonathan

1 reply

cofinoa Apr 29, 2024
Maintainer

Jonathan (@JonathanGregory),

What is the artifact?
The artifact it's the archive of the complete gh-pages site, and for my example you can find it here:
https://github.com/cofinoa/cf-standard-names/actions/runs/8851239187/artifacts/1451516880

and it contains all versions and the corresponding XML, HTML and KWIC files, linked from
https://cofinoa.github.io/cf-standard-names/index-gh-pages.html

Antonio

JonathanGregory · 2024-04-30T20:18:18Z

JonathanGregory
Apr 30, 2024
Maintainer Author

Thanks, Antonio @cofinoa, but do you understand why the artifact is 10 times smaller than the files in the repo (as discussed in https://github.com/orgs/cf-convention/discussions/296#discussioncomment-8756870)?

1 reply

cofinoa May 7, 2024
Maintainer

@JonathanGregory the reason it's because the artifact file it's stored as compressed .zip file.

JonathanGregory · 2024-05-07T14:27:26Z

JonathanGregory
May 7, 2024
Maintainer Author

Yes, I see. If I click on https://github.com/cofinoa/cf-standard-names/actions/runs/8851239187/artifacts/1451516880 it gets downloaded, and I can inspect the contents using unzip and tar. If I understand correctly, what we'd like to devise is a way to store all versions of the standard name table in git, where they would be compressed, and still be able to view them in a brower as we can do now. That would be much more efficient of space than storing each version as independent documents. It works for xml files from @DocOtak's tagged repo e.g. https://github.com/DocOtak/cf_standard_names/raw/78/cf_standard_names/src/cf-standard-name-table.xml, but it does not work for the corresponding html files e.g. https://github.com/DocOtak/cf_standard_names/raw/83/cf_standard_names/build/cf-standard-name-table.html. Despite the suffix, the file is interpreted by the brower as text/plain. Do you have any idea of a way round that, @cofinoa? Thanks. J

1 reply

cofinoa May 24, 2024
Maintainer

@JonathanGregory I have been created a working around.
I have created the following HTML:
https://cofinoa.github.io/cf-standard-names/raw.githubusercontent.com/

Where HTML documents are loaded from the remote github URL, i.e.:
https://github.com/cofinoa/cf_standard_names/raw/83/cf_standard_names/build/cf-standard-name-table.html
and render the contents as HTML.

For this, I have created a wrapper for each HTML file, for example:
https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/cf-standard-names/83/cf-standard-name-table.html

and it loads and parse the remote file based on this script:
https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/remoteload.js

The workaround is also working with links to specific standard names, on specific versions, using the hash in the URL (#):
https://cofinoa.github.io/cf-standard-names/raw.githubusercontent.com/cf-standard-names/83/cf-standard-name-table.html#precipitation_amount

JonathanGregory · 2024-05-14T13:25:16Z

JonathanGregory
May 14, 2024
Maintainer Author

In conventions issue 127, Lars comments that we might wish to assign a DOI for each version of the standard name table and each version of the other tables of controlled vocabulary, including DOIs which point at all times to the current versions. This is not the same issue as how to keep them on the web (the subject of this discussion), but I guess it may be related, so this is a good place to talk about it. Do we want to do it?

0 replies

larsbarring · 2024-05-14T20:37:57Z

larsbarring
May 14, 2024
Collaborator

If we are broadening the discussion, which I think is useful, I suggest there are three more or less related issues that we should consider. In addition to the two already mentioned:

DOIs.
How to accommodate on the web site the ever-growing size of the tables.

There is a third one

The directory structure for storing the standard name tables in different formats and the supporting files.

While the first one is of more general nature, the latter are two are connected, in particular if we want to published zipped files, in which case it might be more useful to bundle the xml, html and kwik versions together in one zip file.

Without going into too much technical details, I just want to outline what 3 is about. Currently the directory for each standard name table contains a hierarchy of several subdirectories where the different formats ends up at different places together with supporting files that are the same across versions (or indeed not the same despite having the same filename). This is not helpful and can easily cause confusion, as was seen with respect to the XML schema files (see here). It appears that the XLS files for converting the xml version to the http version have the same problem. The same file name appears under the directory for each standard name table version despite substantial change.. I.e. versioning is not used, despite the file name contains a version element. In addition to this there appear to be a problem with respect to what is the current version (see here). I suggest that reorganising the directory structure will help us to save space, get a better overview of different version of the supporting files, and pave the way to more efficient publishing of zipped versions of earlier table versions if we decide so.

While the specifics of each of the three issues probably should be dealt with separately, I agree with Jonathan that I would be relevant to first havea more general conversation regarding all this.

0 replies

JonathanGregory · 2024-05-29T09:20:23Z

JonathanGregory
May 29, 2024
Maintainer Author

Dear Antonio @cofinoa, Andrew @DocOtak, Lars @larsbarring, Alison @japamment, Fran @feggleton, Ellie @efisher008

This is very clever! Thanks. Your javascript gives us a way for the browser to interpret a data file of html as a web page, without using a third-party service.

My understanding is this: Andrew @DocOtak's method uses git for version control of the standard name table files (xml html xls and whatever else is release-dependent), like we do for the conventions document. With this system, there would be just one directory for these files, rather than a separate directory for each release as there is now. Releases are identified by being given a tag. The tag provides a release-specific URL for each released version of each file, constituted on the fly from the repo, without all the versions being stored as static files in the way we have done up to now. This will make the repo much smaller and thus we will no longer have difficulties with the size limit of GitHub Pages websites.

We could replace the links to the xml files on the vocabulary page with links to the xml files in the tagged versions of the repo. For the html of the standard name table, and any other file to be rendered as html, we need a small release-specific wrapper html file (not containing the content), which invokes Antonio's javascript to render the file from the tagged version of the repo.

Have I understood correctly, @cofinoa and @DocOtak? Would this address the issues you have summarised, Lars? Alison, Fran and Ellie, would you be willing to change to this new method of managing releases of the standard name table? I assume we could do the same with the other controlled vocabularies, but that is not urgent because they take up much less space.

Best wishes

Jonathan

0 replies

durack1 · 2024-05-29T22:01:52Z

durack1
May 29, 2024

I'm just adding myself in thread as this discussion is very interesting to me. Back in 2019, due to changes in github add-on services, I had to remap from an external rawgit service to github-pages (see WCRP-CMIP/CMIP6_CVs#624 for more info). Being able to serve up multiple versions using URLs including the tag sounds like a great use of the system to me, assuming the changes I hit back in ~2019 admittedly to external non-github services don't recur.

If I am interpreting this correctly, we could have a URL like

wcrp-cmip.github.io/CMIP6_CVs/raw/SOMETAGIDENTIFER/somepage.html (or similar, I am missing the raw.githubusercontent.com etc)

With the SOMETAGIDENTIFER toggle-able out with the latest e.g., "6.2.58.68" or some distant past "6.2.0.1" tag, correct?

@cofinoa where did you find the info for https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/remoteload.js from?

@taylor13 @wolfiex @matthew-mizielinski

1 reply

cofinoa Jun 3, 2024
Maintainer

@cofinoa where did you find the info for https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/remoteload.js from?

I have adapted from:
https://stackoverflow.com/questions/22945884/domparser-appending-script-tags-to-head-body-but-not-executing/58862506#58862506

which solves the issue of "rerun" existing scripts when the HTML DOM is been rewriten from an HTML string.

This HTML string is loaded from the remote URL:
https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/remoteload.js#L51-L54

and it's been parsed to HTML DOM:
https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/remoteload.js#L9-L10

but the scripts, on the HTML string, to be loaded and run by the new HTML DOM, need to be "rewritten":
https://github.com/cofinoa/cf-standard-names/blob/gh-pages/docs/raw.githubusercontent.com/remoteload.js#L16-L36

@durack1 , probably we can provide a similar solution for the WCRP-CMIP/CMIP6_CVs#624

PS: I have tested other aproches to rewrite and render the HTML, like:

document.write function,
<embed>
and rewriting innerHTML on a <div> element

for the remote HTML string, but those weren't full functional, probably because I miss something using them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CF Conventions

How to keep all versions of the standard name table available on the web #296

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 11 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

CF Conventions

How to keep all versions of the standard name table available on the web #296

JonathanGregory Mar 12, 2024 Maintainer

Topic for discussion

Replies: 11 comments · 15 replies

JonathanGregory Mar 12, 2024 Maintainer Author

JonathanGregory Mar 12, 2024 Maintainer Author

DocOtak Mar 13, 2024 Maintainer

ethanrd Mar 13, 2024 Maintainer

larsbarring Mar 13, 2024 Collaborator

JonathanGregory Mar 14, 2024 Maintainer Author

DocOtak Mar 14, 2024 Maintainer

JonathanGregory Mar 15, 2024 Maintainer Author

JonathanGregory Mar 15, 2024 Maintainer Author

larsbarring Mar 15, 2024 Collaborator

cofinoa Apr 26, 2024 Maintainer

JonathanGregory Apr 28, 2024 Maintainer Author

cofinoa Apr 29, 2024 Maintainer

JonathanGregory Apr 30, 2024 Maintainer Author

cofinoa May 7, 2024 Maintainer

JonathanGregory May 7, 2024 Maintainer Author

cofinoa May 24, 2024 Maintainer

JonathanGregory May 14, 2024 Maintainer Author

larsbarring May 14, 2024 Collaborator

JonathanGregory May 29, 2024 Maintainer Author

durack1 May 29, 2024

cofinoa Jun 3, 2024 Maintainer

JonathanGregory
Mar 12, 2024
Maintainer

Replies: 11 comments 15 replies

JonathanGregory
Mar 12, 2024
Maintainer Author

JonathanGregory
Mar 12, 2024
Maintainer Author

DocOtak Mar 13, 2024
Maintainer

ethanrd Mar 13, 2024
Maintainer

larsbarring
Mar 13, 2024
Collaborator

JonathanGregory Mar 14, 2024
Maintainer Author

DocOtak Mar 14, 2024
Maintainer

JonathanGregory Mar 15, 2024
Maintainer Author

JonathanGregory Mar 15, 2024
Maintainer Author

larsbarring Mar 15, 2024
Collaborator

cofinoa
Apr 26, 2024
Maintainer

JonathanGregory
Apr 28, 2024
Maintainer Author

cofinoa Apr 29, 2024
Maintainer

JonathanGregory
Apr 30, 2024
Maintainer Author

cofinoa May 7, 2024
Maintainer

JonathanGregory
May 7, 2024
Maintainer Author

cofinoa May 24, 2024
Maintainer

JonathanGregory
May 14, 2024
Maintainer Author

larsbarring
May 14, 2024
Collaborator

JonathanGregory
May 29, 2024
Maintainer Author

durack1
May 29, 2024

cofinoa Jun 3, 2024
Maintainer