-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Remove GitHub asset checksum #333
Conversation
Upon close inspection of the code changes introduced in 57e9668, it turns out that the portion of code that compares the checksum of the local file against the uploaded file was innefective to protect against the bug introduced in c978bf6. The checksum comparison compared bytes after reading an asset file from disk, and essentially protected us from mishandling on GitHub's side, but not from reading and uploading incorrect data, specifically: readFileSync(path, { encoding: 'binary' }) The bug was fixed in 4697dbd, but notice how from then on we started using `readFileSync` instead of `createReadStream` (a potential problem for large assets as we don't really need to hold all bytes in memory at once). This change reintroduces the use of `createReadStream`, and reverts the several iterations of trying to do checksums (57e9668, ea24f4d, and d14a988). We choose to remove the checksum-related code because it is fragile, depends on undocumented APIs and has shown to be problematic in recently attempts to make releases with Craft, leading up to new incidents. And, as stated earlier, because it doesn't really protect us from the original problem it was trying to address. Note that we keep the size check introduced in the same commit 57e9668 and that we've proved that it, by itself, is capable of detecting the original bug, as described below. First we reintroduce the bug: $ git diff diff --git a/src/targets/github.ts b/src/targets/github.ts index 8fb44cd..62aa2e1 100644 --- a/src/targets/github.ts +++ b/src/targets/github.ts @@ -1,6 +1,6 @@ import { Octokit, RestEndpointMethodTypes } from '@octokit/rest'; import { RequestError } from '@octokit/request-error'; -import { createReadStream, promises, statSync } from 'fs'; +import { readFileSync, promises, statSync } from 'fs'; import { basename } from 'path'; import { getConfiguration } from '../config'; @@ -362,7 +362,7 @@ export class GithubTarget extends BaseTarget { ).start(); try { - const file = createReadStream(path); + const file = readFileSync(path, { encoding: 'binary' }); const { url, size } = await this.handleGitHubUpload({ ...params, // XXX: Octokit types this out as string, but in fact it also $ cd craft $ yarn build Then we make a release and see that the asset size check fails: $ cd release-tester $ node --enable-source-maps ../craft/dist/craft publish 0.26.0 --target github --no-merge --no-input ... Error: Uploaded asset size (440 bytes) does not match local asset size (413 bytes) for "gh-pages.zip". Removing the above patch, rebuilding and running the release process again leads to a successful release (and one that doesn't depend on checksum checking). We could re-introduce checksumming in the future, but then probably a better approach would involve either using some (new) GitHub API that can provide a checksum (non-existent today) or to fully download assets. Additionally, it would be useful to perform asset-specific/project-specific sanity checks, for example, for binary executables we could download the asset and run `./binary --help` to ensure it succeeds or `./binary --version` to ensure it produces the expected output.
try { | ||
const file = readFileSync(path); | ||
const file = createReadStream(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since most of the files we upload are small files, using readFileSync
is actually safer and more efficient, especially compared to a read stream. This is a lesson we learned from npm
folks when working on Yarn: yarnpkg/yarn#3539
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reference, @BYK.
The change from createReadStream
to readFileSync
was not documented in #290, seemed like the motivation was just to get types to match (string
) 🤔 which led to the broken asset uploads.
Later in #328 (comment) we were discussing streaming... (at that time I had not realized that we changed from stream to read-all also in #290, relatively recently), and the concern was memory usage.
I haven't benchmarked, but I believe for our use case with Craft the performance difference probably doesn't matter?
readFileSync
is actually safer and more efficient
I can intuitively understand the "more efficient", but, I'm curious, why do you say it is also "safer"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is OK is because pretty much any files this would
handle would fit neatly into memory (any npm packages MUST fit
into memory by definition, because of the way npm@<5 does extracts).If you really want to make doubleplus sure to minimize memory usage,
you could do an fs.stat to find the file size and then do heuristics
to only use streams for files bigger than MB.
Quoted from yarnpkg/yarn#3539
Craft asset sizes, theoretically, don't really need to fit in memory (though we're probably not uploading any large assets in Sentry, and I don't even know what GitHub's limit is tbh).
The second paragraph applies -- we already do a stat
on the asset file, we could choose between readFileSync
and createReadStream
, but not sure the added complexity is worth it. More code, more surface area for things to go wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from createReadStream to readFileSync was not documented in #290, seemed like the motivation was just to get types to match (string) 🤔 which led to the broken asset uploads.
Yup, my bad for not making the reason for change explicit, sorry 😞
Re in #328, I think we all decided that the asset sizes not fitting into memory was a distant potential issue, hence me being content with readFileSync
.
I haven't benchmarked, but I believe for our use case with Craft the performance difference probably doesn't matter?
You may be surprised 🙂
I can intuitively understand the "more efficient", but, I'm curious, why do you say it is also "safer"?
Because dealing with streams in Node is close to insanity. In a world dominated by promises nowadays, streams still rely on events and it is very easy to not handle a case where the stream errored out etc. Now you'd expect the Octokit library would handle that but we've already seen that this API was not stable enough with the bad typing information (yes this sounds like FUD but essentially, based on my past experience with streams, I'd try to avoid them at all costs).
Anyway, I just wanted to provide the context around it. This code worked with createReadStream
for years so I don't think there's a strong case for going either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @BYK, I really appreciate you taking the time to provide us so much insight 🙇 ❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cut #334 to whomever comes next maintaining Craft, no need to rush to make changes now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL streams <> node.
Upon close inspection of the code changes introduced in 57e9668, it turns out that the portion of code that compares the checksum of the local file against the uploaded file was innefective to protect against the bug introduced in c978bf6.
The checksum comparison compared bytes after reading an asset file from disk, and essentially protected us from mishandling on GitHub's side, but not from reading and uploading incorrect data, specifically:
The bug was fixed in 4697dbd, but notice how from then on we started using
readFileSync
instead ofcreateReadStream
(a potential problem for large assets as we don't really need to hold all bytes in memory at once).This change reintroduces the use of
createReadStream
, and reverts the several iterations of trying to do checksums (57e9668, ea24f4d, and d14a988).We choose to remove the checksum-related code because it is fragile, depends on undocumented APIs and has shown to be problematic in recently attempts to make releases with Craft, leading up to new incidents. And, as stated earlier, because it doesn't really protect us from the original problem it was trying to address.
Note that we keep the size check introduced in the same commit 57e9668 and that we've proved that it, by itself, is capable of detecting the original bug, as described below.
First we reintroduce the bug:
Then we make a release and see that the asset size check fails:
Removing the above patch, rebuilding and running the release process again leads to a successful release (and one that doesn't depend on checksum checking).
We could re-introduce checksumming in the future, but then probably a better approach would involve either using some (new) GitHub API that can provide a checksum (non-existent today) or to fully download assets.
Additionally, it would be useful to perform asset-specific/project-specific sanity checks, for example, for binary executables we could download the asset and run
./binary --help
to ensure it succeeds or./binary --version
to ensure it produces the expected output.Related: