-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjustable limit on the number of locations per storage config #1068
base: main
Are you sure you want to change the base?
Conversation
public class AzureStorageConfigurationInfo extends PolarisStorageConfigurationInfo { | ||
// technically there is no limitation since expectation for Azure locations are for the same | ||
// storage account and same container | ||
@JsonIgnore private static final int MAX_ALLOWED_LOCATIONS = 20; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the 3 cloud providers had different values, how about different configuration params?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that currently it's impossible (or very difficult) to hit a limitation on any of the cloud providers due to the fact that requests to the subscoping service only pertain to whatever table is being accessed.
In the future, if tables really could have so many locations that these limits become possible to hit, we may need different limits for different clouds. But I suspect that we would also need different limits for the number of locations that a catalog can have vs. the number of locations a table can have, so those cloud-specific values we might add at that time wouldn't conflict with the catalog-level config being added here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my comment here: #1068 (comment)
PolarisConfiguration.<Integer>builder() | ||
.key("STORAGE_CONFIGURATION_MAX_LOCATIONS") | ||
.description("How many locations can be associated with a storage configuration") | ||
.defaultValue(20) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a limit in the catalog level if a table can only be possible with three locations specified by properties location
, write.data.path
, and write.metadata.path
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently we don't use this value for those locations; this is a limit on the allowed-locations a catalog can have
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't express it clearly originally. My question is about why we need a limit on the allowed-locations of a catalog? Per our off-line discussion, IIUC, it's needed for table level policy used by credential vending so that the policy text won't be too long to exceed certain limit. Given that table locations are limited even there could be a large number of allowed-locations of its catalog, it seems not an issue, and no reason to put a limit. Am I understanding correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. I would be okay with removing a limit totally, but on the other hand @collado-mike is mentioning the possibility of having 3 limits. I think preserving at least one limit, so we have the concept of a limit, may be helpful in the future especially if we do push storage configs down to the table level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with storage configs pushing down to table level, the chance of a unbounded number of table location is quite small.
- For majority Iceberg use cases within Polaris, writers will only possible to use three locations specified by properties
location
,write.data.path
, andwrite.metadata.path
. - For migration use case, admittedly it is possible that are more than 3 locations mentioned above. However, users should be aware of the number of locations while migration, and add them to the table-level storage configs. At that time, we can enforce it by saying "that's too many locations, credential vending won't work." The limit seems better at table-level, as locations from different tables may not overlap.
In short, a limit at catalog level doesn't seem necessary now, and may not be effective in the future. I'd consider to remove it. But I'm open to be convinced by other use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will go with no limit. I think the current users should be fine as we relax the limit by doing so. Nothing should be broken. If users start to ask for a limit due to whatever reason, then we can think of adding a limit config at that time. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the PR to simply have no config for now 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a matter of best practice, it is very nearly always a good idea to make behavior changes behind a config flag. It's very easy to add a config that allows replicating the exact behavior we see today and then to remove that config if users are happy with the proposed change. What's not easy to do is to bring that behavior back to existing deployments once the code is ripped out. Making small, incremental changes to ensure we understand the unintended side effects seems pretty uncontroversial to me. I think Yufei's suggestion to keep the config but mark it deprecated is reasonable and allows us the opportunity to make changes carefully. Why is that controversial?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, so is everyone okay with 1 config now? If so, I will go ahead with restoring the PR to its state before this commit.
I'm not sure what exactly it means to mark a config as "deprecated", but we cannot easily remove a config once we've added one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok to 'disagree and commit' to support one config. but I think we ought to at least have one config that defaults to the current behavior and allow users to increase as they see fit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, as a followup, we can provide an error message while the policy/rule exceeds the length, we may need to change the method getSubscopedCreds()
a bit.
Looks like this unit test failed, |
@eric-maynard Thanks for the changes. I have a question with respect to skipping the configuration of allowed locations and prefix validation. Is this in scope for this PR and if so does this not require any change in the prefix validation logic? If I am not mistaken we need handling in InMemoryStorageIntegration:validateSubpathsOfAllowedLocations |
85b2f86
to
f2c5d04
Compare
Adjusted the PR back to having a config; fixed a test |
…olaris into adjust-location-limit
Holding this for a moment to potentially rebase onto #1124 |
Currently, the number of locations allowed in a storage configuration is hard-coded per cloud provider. In fact, Polaris avoids building a policy that uses every location at once and so there isn't a need to limit this number. Only if a table spanned N locations would a request to (e.g.) STS actually include a policy that has N locations.
This PR introduces a config to adjust that limit, and also raises the default.