fix: Remove GitHub asset checksum #333

rhcarvalho · 2021-12-07T13:30:27Z

Upon close inspection of the code changes introduced in 57e9668, it turns out that the portion of code that compares the checksum of the local file against the uploaded file was innefective to protect against the bug introduced in c978bf6.

The checksum comparison compared bytes after reading an asset file from disk, and essentially protected us from mishandling on GitHub's side, but not from reading and uploading incorrect data, specifically:

readFileSync(path, { encoding: 'binary' })

The bug was fixed in 4697dbd, but notice how from then on we started using readFileSync instead of createReadStream (a potential problem for large assets as we don't really need to hold all bytes in memory at once).

This change reintroduces the use of createReadStream, and reverts the several iterations of trying to do checksums (57e9668, ea24f4d, and d14a988).

We choose to remove the checksum-related code because it is fragile, depends on undocumented APIs and has shown to be problematic in recently attempts to make releases with Craft, leading up to new incidents. And, as stated earlier, because it doesn't really protect us from the original problem it was trying to address.

Note that we keep the size check introduced in the same commit 57e9668 and that we've proved that it, by itself, is capable of detecting the original bug, as described below.

First we reintroduce the bug:

$ git diff
diff --git a/src/targets/github.ts b/src/targets/github.ts
index 8fb44cd..62aa2e1 100644
--- a/src/targets/github.ts
+++ b/src/targets/github.ts
@@ -1,6 +1,6 @@
 import { Octokit, RestEndpointMethodTypes } from '@octokit/rest';
 import { RequestError } from '@octokit/request-error';
-import { createReadStream, promises, statSync } from 'fs';
+import { readFileSync, promises, statSync } from 'fs';
 import { basename } from 'path';

 import { getConfiguration } from '../config';
@@ -362,7 +362,7 @@ export class GithubTarget extends BaseTarget {
     ).start();

     try {
-      const file = createReadStream(path);
+      const file = readFileSync(path, { encoding: 'binary' });
       const { url, size } = await this.handleGitHubUpload({
         ...params,
         // XXX: Octokit types this out as string, but in fact it also

Then we make a release and see that the asset size check fails:

$ cd craft
$ yarn build
$ cd release-tester
$ node --enable-source-maps ../craft/dist/craft publish 0.26.0 --target github --no-merge --no-input
...
Error: Uploaded asset size (440 bytes) does not match local asset size (413 bytes) for "gh-pages.zip".

Removing the above patch, rebuilding and running the release process again leads to a successful release (and one that doesn't depend on checksum checking).

We could re-introduce checksumming in the future, but then probably a better approach would involve either using some (new) GitHub API that can provide a checksum (non-existent today) or to fully download assets.

Additionally, it would be useful to perform asset-specific/project-specific sanity checks, for example, for binary executables we could download the asset and run ./binary --help to ensure it succeeds or ./binary --version to ensure it produces the expected output.

Upon close inspection of the code changes introduced in 57e9668, it turns out that the portion of code that compares the checksum of the local file against the uploaded file was innefective to protect against the bug introduced in c978bf6. The checksum comparison compared bytes after reading an asset file from disk, and essentially protected us from mishandling on GitHub's side, but not from reading and uploading incorrect data, specifically: readFileSync(path, { encoding: 'binary' }) The bug was fixed in 4697dbd, but notice how from then on we started using `readFileSync` instead of `createReadStream` (a potential problem for large assets as we don't really need to hold all bytes in memory at once). This change reintroduces the use of `createReadStream`, and reverts the several iterations of trying to do checksums (57e9668, ea24f4d, and d14a988). We choose to remove the checksum-related code because it is fragile, depends on undocumented APIs and has shown to be problematic in recently attempts to make releases with Craft, leading up to new incidents. And, as stated earlier, because it doesn't really protect us from the original problem it was trying to address. Note that we keep the size check introduced in the same commit 57e9668 and that we've proved that it, by itself, is capable of detecting the original bug, as described below. First we reintroduce the bug: $ git diff diff --git a/src/targets/github.ts b/src/targets/github.ts index 8fb44cd..62aa2e1 100644 --- a/src/targets/github.ts +++ b/src/targets/github.ts @@ -1,6 +1,6 @@ import { Octokit, RestEndpointMethodTypes } from '@octokit/rest'; import { RequestError } from '@octokit/request-error'; -import { createReadStream, promises, statSync } from 'fs'; +import { readFileSync, promises, statSync } from 'fs'; import { basename } from 'path'; import { getConfiguration } from '../config'; @@ -362,7 +362,7 @@ export class GithubTarget extends BaseTarget { ).start(); try { - const file = createReadStream(path); + const file = readFileSync(path, { encoding: 'binary' }); const { url, size } = await this.handleGitHubUpload({ ...params, // XXX: Octokit types this out as string, but in fact it also $ cd craft $ yarn build Then we make a release and see that the asset size check fails: $ cd release-tester $ node --enable-source-maps ../craft/dist/craft publish 0.26.0 --target github --no-merge --no-input ... Error: Uploaded asset size (440 bytes) does not match local asset size (413 bytes) for "gh-pages.zip". Removing the above patch, rebuilding and running the release process again leads to a successful release (and one that doesn't depend on checksum checking). We could re-introduce checksumming in the future, but then probably a better approach would involve either using some (new) GitHub API that can provide a checksum (non-existent today) or to fully download assets. Additionally, it would be useful to perform asset-specific/project-specific sanity checks, for example, for binary executables we could download the asset and run `./binary --help` to ensure it succeeds or `./binary --version` to ensure it produces the expected output.

BYK · 2021-12-07T14:34:26Z

src/targets/github.ts

    try {
-      const file = readFileSync(path);
+      const file = createReadStream(path);


Since most of the files we upload are small files, using readFileSync is actually safer and more efficient, especially compared to a read stream. This is a lesson we learned from npm folks when working on Yarn: yarnpkg/yarn#3539

Thanks for the reference, @BYK.

The change from createReadStream to readFileSync was not documented in #290, seemed like the motivation was just to get types to match (string) 🤔 which led to the broken asset uploads.

Later in #328 (comment) we were discussing streaming... (at that time I had not realized that we changed from stream to read-all also in #290, relatively recently), and the concern was memory usage.

I haven't benchmarked, but I believe for our use case with Craft the performance difference probably doesn't matter?

readFileSync is actually safer and more efficient

I can intuitively understand the "more efficient", but, I'm curious, why do you say it is also "safer"?

The reason this is OK is because pretty much any files this would
handle would fit neatly into memory (any npm packages MUST fit
into memory by definition, because of the way npm@<5 does extracts).

If you really want to make doubleplus sure to minimize memory usage,
you could do an fs.stat to find the file size and then do heuristics
to only use streams for files bigger than MB.

Quoted from yarnpkg/yarn#3539

Craft asset sizes, theoretically, don't really need to fit in memory (though we're probably not uploading any large assets in Sentry, and I don't even know what GitHub's limit is tbh).

The second paragraph applies -- we already do a stat on the asset file, we could choose between readFileSync and createReadStream, but not sure the added complexity is worth it. More code, more surface area for things to go wrong.

The change from createReadStream to readFileSync was not documented in #290, seemed like the motivation was just to get types to match (string) 🤔 which led to the broken asset uploads.

Yup, my bad for not making the reason for change explicit, sorry 😞

Re in #328, I think we all decided that the asset sizes not fitting into memory was a distant potential issue, hence me being content with readFileSync.

I haven't benchmarked, but I believe for our use case with Craft the performance difference probably doesn't matter?

You may be surprised 🙂

I can intuitively understand the "more efficient", but, I'm curious, why do you say it is also "safer"?

Because dealing with streams in Node is close to insanity. In a world dominated by promises nowadays, streams still rely on events and it is very easy to not handle a case where the stream errored out etc. Now you'd expect the Octokit library would handle that but we've already seen that this API was not stable enough with the bad typing information (yes this sounds like FUD but essentially, based on my past experience with streams, I'd try to avoid them at all costs).

Anyway, I just wanted to provide the context around it. This code worked with createReadStream for years so I don't think there's a strong case for going either way.

Thanks @BYK, I really appreciate you taking the time to provide us so much insight 🙇 ❤️

I cut #334 to whomever comes next maintaining Craft, no need to rush to make changes now.

TIL streams <> node.

iker-barriocanal approved these changes Dec 7, 2021

View reviewed changes

rhcarvalho merged commit 1e352b9 into master Dec 7, 2021

rhcarvalho deleted the rhcarvalho/no-checksum branch December 7, 2021 14:16

BYK reviewed Dec 7, 2021

View reviewed changes

rhcarvalho mentioned this pull request Dec 7, 2021

Consider using readFileSync in GitHub target #334

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Remove GitHub asset checksum #333

fix: Remove GitHub asset checksum #333

rhcarvalho commented Dec 7, 2021 •

edited by iker-barriocanal

Loading

BYK Dec 7, 2021

rhcarvalho Dec 7, 2021

rhcarvalho Dec 7, 2021

BYK Dec 7, 2021

rhcarvalho Dec 7, 2021

rhcarvalho Dec 7, 2021

iker-barriocanal Dec 10, 2021

fix: Remove GitHub asset checksum #333

fix: Remove GitHub asset checksum #333

Conversation

rhcarvalho commented Dec 7, 2021 • edited by iker-barriocanal Loading

BYK Dec 7, 2021

Choose a reason for hiding this comment

rhcarvalho Dec 7, 2021

Choose a reason for hiding this comment

rhcarvalho Dec 7, 2021

Choose a reason for hiding this comment

BYK Dec 7, 2021

Choose a reason for hiding this comment

rhcarvalho Dec 7, 2021

Choose a reason for hiding this comment

rhcarvalho Dec 7, 2021

Choose a reason for hiding this comment

iker-barriocanal Dec 10, 2021

Choose a reason for hiding this comment

rhcarvalho commented Dec 7, 2021 •

edited by iker-barriocanal

Loading