Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs] Provide documentation on where blob storage is accessed and which keys are being used #4881

Open
2 tasks done
Tracked by #5783
ddl-ebrown opened this issue Feb 12, 2024 · 4 comments · May be fixed by #5931
Open
2 tasks done
Tracked by #5783
Assignees
Labels
backlogged For internal use. Reserved for contributor team workflow. documentation Improvements or additions to documentation hacktoberfest

Comments

@ddl-ebrown
Copy link
Contributor

Description

There are several different places where various Flyte services or tools require blob storage access, and for folks new to the project, it would be nice to have a few notes on the internals of how API calls happen. The initial question I raised in Slack at https://flyte-org.slack.com/archives/CP2HDHKE1/p1707508706335559 was

Is there a good graph of the sequence of API calls between `pyflyte run --remote` / flyteadmin and blob storage endpoints? I'm mostly trying to have a good picture around when `CreateUploadLocation` gets called /used, etc

Some of this is documented in https://docs.flyte.org/en/latest/concepts/data_management.html, but it would be nice to extend that to include:

  • pyflyte run --remote calls FlyteAdmin, calls blob storage to get a signed url and returns it to the client so that it can upload
  • when inputs / outputs are downloaded / uploaded and using what creds for native Flyte tasks
  • when inputs / outputs are downloaded / uploaded for raw container tasks (i.e. by copilot)
  • default config vs. other config mechanisms

While Flyte has a good default approach of using IRSA in AWS, that might not always be the best fit -- so it's helpful to know some of the specifics around blob storage access patterns so that the right solution can be employed.

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@ddl-ebrown ddl-ebrown added documentation Improvements or additions to documentation untriaged This issues has not yet been looked at by the Maintainers labels Feb 12, 2024
Copy link

welcome bot commented Feb 12, 2024

Thank you for opening your first issue here! 🛠

Copy link

dosubot bot commented Feb 12, 2024

🤖

Hey @ddl-ebrown! Great to see you diving deeper into the workings of Flyte. Your curiosity really adds value to the community!

Thank you for your detailed request. I'll try to provide some insights based on the information available in the Flyte codebase and documentation.

  1. pyflyte run --remote calls FlyteAdmin, calls blob storage to get a signed url and returns it to the client so that it can upload: The pyflyte run --remote command interacts with FlyteAdmin and blob storage to get a signed URL and return it to the client for upload. This is done through the Get function in the AWSRemoteURL struct, which is responsible for generating the signed URL. You can find more details in the flyteadmin/pkg/data/implementations/aws_remote_url.go file.

  2. When inputs / outputs are downloaded / uploaded and using what creds for native Flyte tasks: In Flyte, the process of downloading and uploading inputs and outputs for native tasks is managed by the CoreIOStrategy class. This class has two main attributes: download_mode and upload_mode. These attributes are instances of the IOStrategyDownloadMode and IOStrategyUploadMode classes respectively. The credentials used in this process would typically be managed by the underlying data storage or transfer service being used, and would need to be configured separately. You can find more details in the flyteidl/gen/pb_python/flyteidl/service/flyteadmin/flyteadmin/models/core_io_strategy.py file.

  3. When inputs / outputs are downloaded / uploaded for raw container tasks (i.e. by copilot): The context does not provide information on this. It would be helpful if you could provide more details on what you mean by "raw container tasks" and "copilot".

  4. Default config vs. other config mechanisms: Flyte services or tools use various blob storage access points, which are defined in the Config struct in the flytestdlib/storage/config.go file. The Type field in the Config struct sets the type of storage to configure. The Connection field defines the connection configurations, and the Stow field defines the configurations for the Stow backend. You can find more details in the flytestdlib/storage/config.go file.

I hope this information is helpful. If you have any further questions or need more details, please let me know.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@eapolinario eapolinario added backlogged For internal use. Reserved for contributor team workflow. and removed untriaged This issues has not yet been looked at by the Maintainers labels Feb 15, 2024
@Ritesh-Udgata
Copy link

please assign me to this issue

@davidmirror-ops
Copy link
Contributor

@Ritesh-Udgata let us know any questions you may have soon, as Hacktoberfest is closing on Thursday. Thank you for volunteering!

@Ritesh-Udgata Ritesh-Udgata linked a pull request Oct 28, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlogged For internal use. Reserved for contributor team workflow. documentation Improvements or additions to documentation hacktoberfest
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants