Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement 0 dependency streaming multipart/form-data parser #1851

Closed
wants to merge 36 commits into from

Conversation

KhafraDev
Copy link
Member

@KhafraDev KhafraDev commented Jan 9, 2023

This PR implements a multipart/form-data parser with a 1:1 match to busboy's API. This took me about a week to complete and many, many hours. I believe docs and types are needed still, but don't feel like adding them right now.

Tests from busboy are included, and every single test works without any modifications!

  • - docs
  • - benchmarks
  • - types
  • - replace busboy in fetch
  • - use c8 nyc for coverage in busboy tests (busboy uses vanilla node for testing)
  • - ??

Bug(s):

  • path.basename returns different values in windows & linux (C:\\files\\1k_b.dat) fixed in 56052ec
  • reading headers is extremely slow fixed in b7b9645, parsing headers should now be O(n)
  • in certain circumstances FileStream will not emit the 'end' event fixed in dd02fe5

@codecov-commenter
Copy link

codecov-commenter commented Jan 9, 2023

Codecov Report

Base: 90.43% // Head: 90.26% // Decreases project coverage by -0.17% ⚠️

Coverage data is based on head (e21de08) compared to base (06f77a9).
Patch coverage: 87.53% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1851      +/-   ##
==========================================
- Coverage   90.43%   90.26%   -0.17%     
==========================================
  Files          71       76       +5     
  Lines        6137     6463     +326     
==========================================
+ Hits         5550     5834     +284     
- Misses        587      629      +42     
Impacted Files Coverage Δ
lib/formdata/util.js 85.71% <85.71%> (ø)
lib/formdata/parser.js 86.07% <86.07%> (ø)
lib/formdata/textsearch.js 93.75% <93.75%> (ø)
index.js 99.00% <100.00%> (+0.02%) ⬆️
lib/fetch/body.js 97.15% <100.00%> (ø)
lib/fetch/dataURL.js 88.46% <100.00%> (+0.38%) ⬆️
lib/formdata/constants.js 100.00% <100.00%> (ø)
lib/formdata/filestream.js 100.00% <100.00%> (ø)
lib/fetch/file.js 89.65% <0.00%> (-1.15%) ⬇️
lib/fetch/index.js 84.93% <0.00%> (+0.18%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@kibertoad
Copy link
Contributor

@KhafraDev How does the perf compare?

@KhafraDev
Copy link
Member Author

Needs benchmarks but probably slower in some areas, same in others. Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc. While this implementation is much... lazier/easier. Rather than doing that, I join chunks together until one of them contains the boundary, and then split the chunks accordingly. Of course without benchmarks, most of that is just speculation of what's faster.

@jimmywarting
Copy link
Contributor

Busboy uses streamsearch (which implements a Boyer-Moore-Horspool algorithm) to search for boundaries after bodies etc

multipart/form-data is really old and historical. It isn't the best solution for parsing and sending files.
Something that would be better is to have something more like structural data that tells how large a field is in size so that you can read x amount of data instead of scanning a boundary.

I asked if we could at least add content-length to each and every field. (not just the total body size). that would make boundary a bit more obsolete. but it got very idle.

you know what would be awesome? having cbor support baked into fetch! await new Resoponse(data).cbor()
it would totally kick out formdata, json, text and blobs out of existence by having more structural data support. it would more or less support the same things that you could do with an object and structuralClone on. (just wishful thinking)

@KhafraDev
Copy link
Member Author

I asked if we could at least add content-length to each and every field.

I was actually wondering why there wasn't a content-length header -- it's a really poor design overall. What makes websocket frame parsing so much easier is that you know the exact payload length!

@KhafraDev KhafraDev changed the title feat: implement 0 dependency multipart/form-data parser feat: implement 0 dependency streaming multipart/form-data parser Jan 10, 2023
lib/formdata/constants.js Outdated Show resolved Hide resolved
lib/formdata/util.js Show resolved Hide resolved
lib/formdata/util.js Outdated Show resolved Hide resolved
lib/formdata/constants.js Outdated Show resolved Hide resolved
// 1. Let result be the empty string.
let result = ''

// 2. While position doesn’t point past the end of input and the
// code point at position within input meets the condition condition:
while (position.position < input.length && condition(input[position.position])) {
// 1. Append that code point to the end of result.
result += input[position.position]
if (inputIsString) {
result += input[position.position]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If inputIsString you might skip the while loop and provide a fast path for this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't skip the loop entirely, we need to increase the index and check if the character matches the condition

}
}

return false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would chunk.find be faster? Inlined functions are most of the time faster due to severe v8 optimizations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt it, you can't specify a position to start at so I'd have to split the buffer first.

lib/formdata/parser.js Outdated Show resolved Hide resolved
@KhafraDev
Copy link
Member Author

I'm much more interested in replacing the asynchronous parsing with a synchronous parser. Not only is it actually spec compliant, but it will finally cleanup the body consumers. It doesn't make sense to parse a multipart/form-data body asynchronously if it's already in memory. It's also many times slower than Deno/bun (https://twitter.com/jarredsumner/status/1625067963656335361).

cc @jimmywarting I know you disagree with this, do you have any counterarguments before I replace busboy with the proposed alternative?

@mcollina
Copy link
Member

I don't think a synchronous multipart parser would work for us long term.
Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.

@KhafraDev
Copy link
Member Author

This doesn't seem to be an issue with Deno/bun (or any of the browsers, although their use case is different). Since request.body is a ReadableStream, we could document formData's downsides and show an example of asynchronously parsing the body with busboy (or this async parser). It makes it much harder to maintain this section of the code, and many of the WPTs regarding the formdata body mixin are disabled :(.

There is also a proposal to add a limit to the max body size a FormData can have (whatwg/fetch#1592) and it's already an option on undici's Agent. There are alternative apis that undici has that are much better suited for large bodies. The main users of .formData IIRC are library authors who want cross-env compatibility - I wonder how many users are active using it?

One more note: it's incredibly slow. https://twitter.com/jarredsumner/status/1625067963656335361

@KhafraDev
Copy link
Member Author

There's also the issue of how parsing a multipart/form-data should be parsed and there's no actual spec for it. The WPT coverage sucks as well, while it's really good for every other mixin.

Personally I'd like to follow the spec and other platforms who have all implemented the parsing synchronously. See #1694 for proof.

@mcollina
Copy link
Member

I don't think we could ever be in agreement on this one. Crashing for out-of-memory issues when receiving files is not safe.
It's better to not ship a feature than shipping a feature that will crash everything: the number of vulnerabilities we will receive for this one would be massive.

Last time I checked, it seemed that Chrome used disk-based Blobs for at least some of those cases.

The difference with .json() and the others exists because we need all the data to process a JS Objects, which is not the case for this one.

Node.js has just landed support for disk-baker Blobs, so we will soon be able to use this ;).

@jimmywarting
Copy link
Contributor

jimmywarting commented Feb 25, 2023

I don't think a synchronous multipart parser would work for us long term.
Node.js just gained support for disk-baked Blobs. This means we can stream files to disk and then create Blobs that way and avoid memory leaks. However we can do that only if we have a streaming multipart parser.

I agree that disk based blob/files would be the best option and that async streaming is best for not hitting the RAM limit. would rather want to have something that's working and runs a bit slower. then rather (as @mcollina puts it) "not ship a feature at all" with the risk for out-of-memory issues.

it would also be technically possible to also write all of formdata payload to one single file first and then slice it (virtually - by just changing the offset + size, of where it should start/stop reading from)

const blob = fs.openAsBlob(path) // could include all formdata entries
const file1 = new File( [ blob.slice(100, 1024) ], entry.name, { type: entry.mimeType } )

so if you want to write all the data to the disc first and then synchronous search the file after each boundary to lookup where each file entry begin/ends then that could also be an option.


currently undici url encoded formdata decoder is synchronous.
b/c it dose not include files, and probably isn't as large as formdata payloads are. and that it uses URLSearchParams to decode everything.

So i'm wondering is this: https://twitter.com/jarredsumner/status/1625067963656335361 measuring URL encoded payload or multipart formdata encoded payload, cuz it's two completely different path of solving the issue at hand.

we could document formData's downsides

if we are going to document the downside of formdata, then we shouldn't provide them with an alternative solution to how they should do it themself using busboy and streams. it should just work as it's intended to work.
if we want to say that multipart/formdata it's slower then that's fine, recommend that ppl instead use URL encoded payload for faster parsing or use json. multipart/formdata is intended of file uploads so that is what it should be used for.

so if we should document anything at all, then it should be:

if you intend to send files then by all means you can use FormData() to send payload. but the multipart/formdata parsing will be slower then URLEncoded payloads. so, if you intend to send just simple forms with just text, then use new URLSearchParams() instead of new FormData() when you want to decode the payload with await body.formData() if you intend to send really big files then upload just raw bytes using { body: blob } as that has better streaming compatibility. choose the right encoding for the the job needed

@KhafraDev
Copy link
Member Author

KhafraDev commented Feb 25, 2023

I agree that we probably won't find middle ground for this 😄.

it should just work as it's intended to work.

That's partially the issue - it doesn't work as intended, if by "intended" we're referring to the spec and/or user expectations. There are other slightly less noticeable issues and pretty crappy workarounds we're doing already to match the spec:

undici/lib/fetch/body.js

Lines 461 to 462 in 06f77a9

// Wait a tick before checking if the request has been aborted.
// Otherwise, a TypeError can be thrown when an AbortError should.
.


Anyways, this branch is mostly done, passes all of busboy's tests, passes the same set of WPTs that are enabled. I'll finish up the docs eventually and I think we're good to go? Although I've been holding off because I know there will inevitably be issues that I won't have motivation to fix...

Sticking with busboy is also problematic because it doesn't seem well maintained. There's a number of issues and pull requests open. Plus I've already spent many hours working on this lol

@ronag
Copy link
Member

ronag commented Apr 9, 2023

@KhafraDev What's the status on this? Are we using file backed Blob?

@KhafraDev
Copy link
Member Author

It's mostly done, just need to spent a couple hours adding docs and cleaning stuff up. Is there a way to opt-in to using the file backed Blobs, or is it automatic?

@jimmywarting
Copy link
Contributor

jimmywarting commented Apr 10, 2023

Is there a way to opt-in to using the file backed Blobs, or is it automatic?

Perhaps if content-length is lower than x bytes then it would construct blobs in memory?
Also what if you where to use response.blob()? would it be backed up by the fs then?

EDIT i tested response.blob in chrome. and looked at ~/Library/Application Support/Google/Chrome/Default/blob_storage and found that if the response was small it would just give you a in-memory blob.

but if the response was over a certain amount it would be offloaded it to the disc.
so in the browser this pretty much happens automatically. all tough you would also need to clean up the temporary stored file when there is no reference to it anymore...

i suppose the same logic could be done with FormData, if it encounter a file start reading n bytes, if it's larger than that, then dump it to the file system and pipe the rest of the data to that file.

@ronag
Copy link
Member

ronag commented Feb 25, 2024

@KhafraDev did you give up on this?

@KhafraDev
Copy link
Member Author

yes, the parsing should be done synchronously to match the spec.

@KhafraDev KhafraDev closed this Feb 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants