Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjustable limit on the number of locations per storage config #1068

Open
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

eric-maynard
Copy link
Contributor

@eric-maynard eric-maynard commented Feb 25, 2025

Currently, the number of locations allowed in a storage configuration is hard-coded per cloud provider. In fact, Polaris avoids building a policy that uses every location at once and so there isn't a need to limit this number. Only if a table spanned N locations would a request to (e.g.) STS actually include a policy that has N locations.

This PR introduces a config to adjust that limit, and also raises the default.

Comment on lines 33 to 34
public class AzureStorageConfigurationInfo extends PolarisStorageConfigurationInfo {
// technically there is no limitation since expectation for Azure locations are for the same
// storage account and same container
@JsonIgnore private static final int MAX_ALLOWED_LOCATIONS = 20;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the 3 cloud providers had different values, how about different configuration params?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that currently it's impossible (or very difficult) to hit a limitation on any of the cloud providers due to the fact that requests to the subscoping service only pertain to whatever table is being accessed.

In the future, if tables really could have so many locations that these limits become possible to hit, we may need different limits for different clouds. But I suspect that we would also need different limits for the number of locations that a catalog can have vs. the number of locations a table can have, so those cloud-specific values we might add at that time wouldn't conflict with the catalog-level config being added here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment here: #1068 (comment)

PolarisConfiguration.<Integer>builder()
.key("STORAGE_CONFIGURATION_MAX_LOCATIONS")
.description("How many locations can be associated with a storage configuration")
.defaultValue(20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a limit in the catalog level if a table can only be possible with three locations specified by properties location, write.data.path, and write.metadata.path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we don't use this value for those locations; this is a limit on the allowed-locations a catalog can have

Copy link
Contributor

@flyrain flyrain Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't express it clearly originally. My question is about why we need a limit on the allowed-locations of a catalog? Per our off-line discussion, IIUC, it's needed for table level policy used by credential vending so that the policy text won't be too long to exceed certain limit. Given that table locations are limited even there could be a large number of allowed-locations of its catalog, it seems not an issue, and no reason to put a limit. Am I understanding correctly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right. I would be okay with removing a limit totally, but on the other hand @collado-mike is mentioning the possibility of having 3 limits. I think preserving at least one limit, so we have the concept of a limit, may be helpful in the future especially if we do push storage configs down to the table level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with storage configs pushing down to table level, the chance of a unbounded number of table location is quite small.

  1. For majority Iceberg use cases within Polaris, writers will only possible to use three locations specified by properties location, write.data.path, and write.metadata.path.
  2. For migration use case, admittedly it is possible that are more than 3 locations mentioned above. However, users should be aware of the number of locations while migration, and add them to the table-level storage configs. At that time, we can enforce it by saying "that's too many locations, credential vending won't work." The limit seems better at table-level, as locations from different tables may not overlap.

In short, a limit at catalog level doesn't seem necessary now, and may not be effective in the future. I'd consider to remove it. But I'm open to be convinced by other use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will go with no limit. I think the current users should be fine as we relax the limit by doing so. Nothing should be broken. If users start to ask for a limit due to whatever reason, then we can think of adding a limit config at that time. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the PR to simply have no config for now 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a matter of best practice, it is very nearly always a good idea to make behavior changes behind a config flag. It's very easy to add a config that allows replicating the exact behavior we see today and then to remove that config if users are happy with the proposed change. What's not easy to do is to bring that behavior back to existing deployments once the code is ripped out. Making small, incremental changes to ensure we understand the unintended side effects seems pretty uncontroversial to me. I think Yufei's suggestion to keep the config but mark it deprecated is reasonable and allows us the opportunity to make changes carefully. Why is that controversial?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so is everyone okay with 1 config now? If so, I will go ahead with restoring the PR to its state before this commit.

I'm not sure what exactly it means to mark a config as "deprecated", but we cannot easily remove a config once we've added one.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to 'disagree and commit' to support one config. but I think we ought to at least have one config that defaults to the current behavior and allow users to increase as they see fit.

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, as a followup, we can provide an error message while the policy/rule exceeds the length, we may need to change the method getSubscopedCreds() a bit.

@flyrain
Copy link
Contributor

flyrain commented Mar 4, 2025

Looks like this unit test failed, testExceedMaxAllowedLocations. We need to change that.

@eric-maynard eric-maynard changed the title Adjustable limit on the number of locations per storage config Remove the limit on the number of locations per storage config Mar 4, 2025
@pavibhai
Copy link

pavibhai commented Mar 5, 2025

@eric-maynard Thanks for the changes.

I have a question with respect to skipping the configuration of allowed locations and prefix validation. Is this in scope for this PR and if so does this not require any change in the prefix validation logic?

If I am not mistaken we need handling in InMemoryStorageIntegration:validateSubpathsOfAllowedLocations

@eric-maynard eric-maynard force-pushed the adjust-location-limit branch from 85b2f86 to f2c5d04 Compare March 5, 2025 03:10
@eric-maynard eric-maynard changed the title Remove the limit on the number of locations per storage config Adjustable limit on the number of locations per storage config Mar 5, 2025
@eric-maynard
Copy link
Contributor Author

Adjusted the PR back to having a config; fixed a test

@eric-maynard eric-maynard enabled auto-merge (squash) March 5, 2025 18:30
@eric-maynard eric-maynard disabled auto-merge March 5, 2025 18:57
@eric-maynard
Copy link
Contributor Author

Holding this for a moment to potentially rebase onto #1124

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants