Skip to content

Conversation

tedzhouhk
Copy link

No description provided.

@itay
Copy link

itay commented Jul 15, 2025

@tedzhouhk I'm curious, have we considered doing this over HTTP rather than externalizing the request data into external storage?

For reference, we considered this at Octo, but ended up going the HTTP route. Specifically, we used a queue (in our case, Redis) to write the fact that there was a request and metadata about who held the request (which "frontend"), and then the worker who picked it up would call the original holder to get the actual data. This was to handle the exact use case here of large bodies, such as input images/videos/etc.

The issue with S3 is that it can be both slow and have the costs add up, especially at high request rates where you are paying for the cost of the the operations more than the storage cost. It also avoids needing to have another external dependency (S3/minio/what not) and leave everything more cluster internal.

@tedzhouhk
Copy link
Author

@tedzhouhk I'm curious, have we considered doing this over HTTP rather than externalizing the request data into external storage?

For reference, we considered this at Octo, but ended up going the HTTP route. Specifically, we used a queue (in our case, Redis) to write the fact that there was a request and metadata about who held the request (which "frontend"), and then the worker who picked it up would call the original holder to get the actual data. This was to handle the exact use case here of large bodies, such as input images/videos/etc.

The issue with S3 is that it can be both slow and have the costs add up, especially at high request rates where you are paying for the cost of the the operations more than the storage cost. It also avoids needing to have another external dependency (S3/minio/what not) and leave everything more cluster internal.

@itay good question, I had similar doubts but was convinced by @ryanolson that s3 is fast enough if we decide to store them in disk (time to process those long requests should be way slower than pulling it from s3). If we decide to store them in CPU RAM then I think we should seek other solutions.

@itay
Copy link

itay commented Jul 15, 2025

I'd mostly question whether it's necessary. Given the request will likely end up in RAM anyway as we transmit it, and I'm not sure we generally have a huge queue of requests that get backed up.

@ryanolson
Copy link

We can absolutely store them in CPU buffers in which case we could use NIXL or http to fetch the data.

HTTP would be pretty simple. NIXL is also doable but probably overkill.

I do think S3 will be the easiest (essentially two lines of code with the s3 SDK) and we will want object storage for multimodal in the future.

It's would be interesting to benchmark, but I'm getting the perf delta at 1MiB plus is relatively small.

@nnshah1
Copy link
Contributor

nnshah1 commented Jul 16, 2025

Couple items here - we should ideally reuse or extend the scheme @whoisj added for transmitting payloads over nixl

Would that work here actually? Could we just use nixl to transport? We have the ability to add that to the payload and recover it today for the E / P /D work

Another thought: why not split the request into multiple nats messages?

@vvenkates27
Copy link

@ryanolson NIXL already also supports s3 through object storage plugin backed by AWS s3 SDK, we can use this today to store and retrieve through S3 object store

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants