Skip to content

Support for parallel upload of large artifacts #573

@syed

Description

@syed

Issue

The OCI spec limits upload of artifacts to be serial in nature. You need to process the upload in-order because we need to calculate the checksum of the incoming data. The sha256 algorithm in not "distributable" it needs to be run serially. This creates problems when uploading large artifacts like ML models. This makes using OCI for large artifacts unattractive as compared to uploading in an object storage like S3 which supports multipart uploads.

There have been prior attempts at addressing this where the idea was to support out-of-order chunked uploads. This however leaves the assembly and checksum validation on the registry which might take a long time to do this or may not have resources to pull a large blob in-memory/on disk to calculate the final checksum

Use cases

Large artifacts are becoming prevalent in the OCI space. A few examples

  • AI models
  • VM images
  • DB backups
  • Binaries for libs

Proposal

The proposal is to introduce a new layer mediaType which is an indirection to a index of chunks.

Image

The chunk-index is another blob which holds a list of chunks with their sizes and offsets. The clients, when uploading will chunk the file can upload each chunk in parallel. Once all the chunks are uploaded, the client creates the chunk index and pushes that and then finally creates the manifest which references the blob chunk index.

The advantage here is that there is no extra processing required on the registry side. There is no reassembly required on the server which means the blob can be "committed" on the registry as soon as the final chunk is uploaded.

Considerations/Issues

  1. Older clients and registries will not be able to support this and we have to find some way of being backwards compatible.
  2. Since there is no full reassembly on the registry. The full artifact is only available when assembly is done on the client side (as opposed to S3 where reassembly is done on S3)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions