Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE REQUEST #32] On-Premises S3 / S3 Compatible... #389

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

lefebsy
Copy link

@lefebsy lefebsy commented Oct 21, 2024

Description (edited) :

This is a S3 proposition of Polaris core storage implementation, copy of the aws + new parameters : endpoint, path style...

It is tested OK with :

This should works with many S3 compatible solutions like Dell ECS, NetApp StorageGRID, etc...

  • By default it is trying to respect the same behavior about credentials than AWS (IAM/STS). The same dynamic policy is applied, limiting the scope to the data queried.

  • Otherwise if STS is not available 'skipCredentialSubscopingIndirection' = true will disabling Polaris "SubScoping" of the credentials

Let me know your opinion about this design proposal.
Thank you

curl -X POST -H "Authorization: Bearer ${SPARK_BEARER_TOKEN}" \
     -H 'Accept: application/json' -H 'Content-Type: application/json' \
     http://${POLARIS_HOST}:8181/api/management/v1/catalogs -d \
      '{
          "name": "my-s3compatible-catalog",
          "id": 100,
          "type": "INTERNAL",
          "readOnly": false,
          "properties": {
            "default-base-location": "${S3_LOCATION}"
          },
          "storageConfigInfo": {
            "storageType": "S3_COMPATIBLE",
            "allowedLocations": ["${S3_LOCATION}/"],
            "s3.endpoint": "https://localhost:9000"
          }
        }'
            # optional:
            "s3.pathStyleAccess": true / false
            "s3.region": "rack-1 or us-east-1"
            "s3.roleArn": "arn:xxx:xxx:xxx:xxxx"
            "s3.credentials.catalog.accessKeyId": "CATALOG_1_ACCESS_KEY_ENV_VARIABLE_NAME"
            "s3.credentials.catalog.secretAccessKey": "CATALOG_1_SECRET_KEY_ENV_VARIABLE_NAME"
            # optional in case STS/IAM is not available :
            "skipCredentialSubscopingIndirection": true
                # optional for skipCredentialSubscopingIndirection - served by catalog to client like spark or trino
                "s3.credentials.client.accessKeyId": "CLIENT_OF_CATALOG_1_ACCESS_KEY_ENV_VARIABLE_NAME"
                "s3.credentials.client.secretAccessKey": "CLIENT_OF_CATALOG_1_SECRET_KEY_ENV_VARIABLE_NAME"

Included Changes:

  • New type of storage "S3_COMPATIBLE".
  • Tested against MinIO with self-signed certificate
  • regtests/run_spark_sql_s3compatible.sh

Type of change:

  • Bug fix (non-breaking change which fixes an issue)
  • Documentation update
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Checklist:

Please delete options that are not relevant.

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • If adding new functionality, I have discussed my implementation with the community using the linked GitHub issue

@lefebsy lefebsy changed the title add s3 compatible storage - first commit [FEATURE REQUEST] On-Premise S3... #32 Oct 21, 2024
@lefebsy lefebsy changed the title [FEATURE REQUEST] On-Premise S3... #32 [FEATURE REQUEST #32] On-Premise S3... Oct 21, 2024
@lefebsy lefebsy changed the title [FEATURE REQUEST #32] On-Premise S3... [FEATURE REQUEST #32] On-Premises S3... Oct 21, 2024
@lefebsy

This comment was marked as outdated.

@@ -23,6 +23,8 @@ public enum PolarisCredentialProperty {
AWS_KEY_ID(String.class, "s3.access-key-id", "the aws access key id"),
AWS_SECRET_KEY(String.class, "s3.secret-access-key", "the aws access key secret"),
AWS_TOKEN(String.class, "s3.session-token", "the aws scoped access token"),
AWS_ENDPOINT(String.class, "s3.endpoint", "the aws s3 endpoint"),
AWS_PATH_STYLE_ACCESS(Boolean.class, "s3.path-style-access", "the aws s3 path style access"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whether or not to use path-style access

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many S3COMPATIBLE solutions are deployed without network devices or configurations in front of them allowing support of dynamic hosts names including buckets.
TLS certificate with private AC could also be a challenge for dynamic host name. "*. domain" can also be forbidden by some enterprise security policy

Path style is useful in many cases. In ideal world I agree it should stay deprecated...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @eric-maynard was asking if we can change the description to something like this:

Suggested change
AWS_PATH_STYLE_ACCESS(Boolean.class, "s3.path-style-access", "the aws s3 path style access"),
AWS_PATH_STYLE_ACCESS(Boolean.class, "s3.path-style-access", "whether or not to use path-style access"),

I also agreed that we should make sure it false by default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For IBM's watsonx.data product it is set to true by default for Minio and Ceph bucket types, reason being that it's more likely to work. Path style will work regardless of whether the customer has setup wildcard DNS, a TLS certificate with a subject-alternate-name (the wildcard), and the hostname in the zonegroup (for Ceph). Virtual host style will only work if all of those things are done.

It's not a hill I would die on, but it's worthy of consideration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I was referring to the Polaris scope default. Users always have the option to set it to true when creating a new catalog.

Copy link
Contributor

@collado-mike collado-mike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the code here seems like it should be delegated to either catalog-level properties (to init the FileIO with the right endpoint) or customizations to the way we construct the STS builder. I don't think a separate S3_COMPATIBLE StorageIntegration really needs to exist, with the only distinct feature here being the vending of raw credentials.

Personally, I don't think we ought to be vending raw credentials at all. I think that either the service should be able to send sub-scoped, time-bound credentials or, if that's not possible, the query engine ought to have direct access to the long-lived credentials through some other means.

But regardless of whether or not we vend raw credentials, allowing the user to simply declare whatever arbitrary environment variables they want to read from the server and to then get them is 100% not a safe or secure way of implementing this.

Comment on lines 26 to 27
AWS_ENDPOINT(String.class, "s3.endpoint", "the aws s3 endpoint"),
AWS_PATH_STYLE_ACCESS(Boolean.class, "s3.path-style-access", "the aws s3 path style access"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically, these are catalog-level properties, not credentials. I think it's best to avoid overloading the credential providers with the need to populate these configuration properties.

Comment on lines +84 to +71
propertiesMap.put(PolarisCredentialProperty.AWS_ENDPOINT, storageConfig.getS3Endpoint());
propertiesMap.put(
PolarisCredentialProperty.AWS_PATH_STYLE_ACCESS,
storageConfig.getS3PathStyleAccess().toString());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why you need them in this class at all. The catalog properties are used to initialize the FileIO in BasePolarisCatalog. The properties from this class are appended to those properties.

Comment on lines 82 to 85
if (storageConfig.getSkipCredentialSubscopingIndirection() == true) {
LOGGER.debug("S3Compatible - skipCredentialSubscopingIndirection !");
clI = System.getenv(storageConfig.getS3CredentialsClientAccessKeyId());
clS = System.getenv(storageConfig.getS3CredentialsClientSecretAccessKey());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe. Now, if I have privilege to create a catalog, I can simply set this skipCredentialSubcopingIndirection flag and I immediately get access to the raw credentials used to assume any role for any other catalog in the service?

Copy link
Author

@lefebsy lefebsy Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not related

Comment on lines 122 to 123
StsClientBuilder stsBuilder = software.amazon.awssdk.services.sts.StsClient.builder();
stsBuilder.endpointOverride(URI.create(storageConfig.getS3Endpoint()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I read this right, this is the only code that really needs to deviate from the default S3 credential vending process, right? We could realistically refactor the default STS client to be provided by a factory that takes in the storage config. Then we don't need all this duplication

Copy link
Author

@lefebsy lefebsy Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost yes.
If catalog creator do not want to rely on default and global Polaris service keys and want dedicated keys for each catalog, adding credential alternative system in this in factory could be useful. No ?

  • Security pattern like not putting all eggs in one basket
  • Also onPrem S3 buckets can been managed with security pattern where administrators are generating different keys for each s3 user associated with some buckets + assumeRole policy (I have this case in my company, considered as secutity segreggation pattern). In this case Polaris with central and unique keys need to be over instanciated :-( Managed at catalog level it allow to keep one Polaris service and many catalogs
        //not using provider built credentials from standard AWS env var
        stsBuilder.credentialsProvider(
            StaticCredentialsProvider.create(AwsBasicCredentials.create(caI, caS)));

Comment on lines 39 to 44
private @Nullable String s3CredentialsCatalogAccessKeyId;
private @Nullable String s3CredentialsCatalogSecretAccessKey;
private @Nullable Boolean s3PathStyleAccess;
private @NotNull Boolean skipCredentialSubscopingIndirection;
private @Nullable String s3CredentialsClientAccessKeyId;
private @Nullable String s3CredentialsClientSecretAccessKey;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the distinction between s3CredentialsCatalogAccessKeyId and s3CredentialsClientAccessKeyId? Rather than having users set the names of environment variables, can't we use AWS profiles? That's a heck of a lot safer than allowing users to put the name of any environment variable and read it directly from the server.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in the updated PR summary..
The idea was to use

  • 'catalog' key only for polaris s3 access and sts generation
  • 'client' keys served to spark or trino when STS are not available.

Replacing variable name by Profile name... I like it !
Yes it could probably works, harder to integrate the file in a kubernetes deployement. Secrets are not allowed in configMap, all the file content should be stored as secret and mounted as volume. Except if profile file is able to interprete variables for secret keys...

@mmgaggle
Copy link

mmgaggle commented Feb 11, 2025

Most of the code here seems like it should be delegated to either catalog-level properties (to init the FileIO with the right endpoint) or customizations to the way we construct the STS builder. I don't think a separate S3_COMPATIBLE StorageIntegration really needs to exist, with the only distinct feature here being the vending of raw credentials.

Is there a way to define a storage endpoint that is used for S3 and STS per catalog without this?

I also tend to agree about vending raw credentials, if a storage backend doesn't support IAM/STS and session policies, tough cookies. Vending scoped credentials is one of the marquee features for Polaris, and vending a static credential might give users a false sense of security.

@flyrain
Copy link
Contributor

flyrain commented Feb 12, 2025

Vending scoped credentials is one of the marquee features for Polaris, and vending a static credential might give users a false sense of security.

We definitely want to avoid that, because handing out raw credentials is pretty much the same as making them public—any Polaris user could easily access them.

@collado-mike
Copy link
Contributor

Is there a way to define a storage endpoint that is used for S3 and STS per catalog without this?

We do this in one of the integ tests that spins up a mock S3 container - see https://github.com/apache/polaris/blob/main/integration-tests/src/main/java/org/apache/polaris/service/it/test/PolarisSparkIntegrationTest.java#L116-L134

@lefebsy
Copy link
Author

lefebsy commented Feb 13, 2025

Thanks a lot, @lefebsy, for adding this! I believe this is a really useful feature—appreciate your effort. Apologies for the delay.

I agree with @collado-mike that we should avoid vending raw credentials.

You're all welcome.
I can uderstand the motivation behind not vending raw credentials. (Even though I will personally miss it strongly in my business)
I've prepared the quarkus rebase, so will add the easy suggestions of this review and of course keep only STS. Squashed some intermediate commits to be able to have a clean rebase since quarkus.

Longer modifications suggested (profile, refactor of policy methods from aws) will be for next step. Advice welcome :-)

lefebsy and others added 5 commits March 4, 2025 22:36
Better descriptions typo & comments
Refacoring with skipCredentialSubscopingIndirection -> finaly removed
Rebase with AWS updates from main branch adding roleArn, camelCase refactoring, typo, cleaning
Add default AWS credentials provider for STS
Error Co-authored-by: Gerrit-K <[email protected]>
Rebase from quarkus and keep only sts with some suggestions from code review
helm unit test
@mmgaggle
Copy link

mmgaggle commented Mar 4, 2025

Thank you for working on this!

@lefebsy
Copy link
Author

lefebsy commented Mar 4, 2025

Hello,

Last refactoring vending only STS.

  • Support of profile to manage credentials used by catalog to communicate with S3
  • Refactor the duplicated functions used by 'Aws' and also by 'S3Compatible'. They have been moved to StorageUtil, here they should be imported and used by 'S3Compatible' or either by 'Aws' if this place is adopted.
curl -X POST -H "Authorization: Bearer ${SPARK_BEARER_TOKEN}" \
     -H 'Accept: application/json' -H 'Content-Type: application/json' \
     http://${POLARIS_HOST}:8181/api/management/v1/catalogs -d \
      '{
          "name": "my-s3compatible-catalog-1",
          "id": 100,
          "type": "INTERNAL",
          "readOnly": false,
          "properties": {
            "default-base-location": "${S3_LOCATION}"
          },
          "storageConfigInfo": {
            "storageType": "S3_COMPATIBLE",
            "allowedLocations": ["${S3_LOCATION}/"],
            "s3.endpoint": "https://localhost:9000"
          }
        }'

As is AWS SDK will use all default values and settings available in Polaris service, to build catalog communication to S3 endpoint

Otherwise indications can be given

          # optional - 'Indicate a AWS Profile name'
          "s3.profileName": "minio-catalog-1",

or

          # optional - 'Indicate Env variable name'
          "s3.credentials.catalog.accessKeyEnvVar": "CATALOG_S3_KEY_ID_FOR_CATALOG_1",
          "s3.credentials.catalog.secretAccessKeyEnvVar": "CATALOG_S3_KEY_SECRET_FOR_CATALOG_1",

and if helpful

          # optional
          "s3.region": "region-1",
          "s3.pathStyleAccess": true,
          "s3.roleArn": "arn:xxx:xxx:xxx:xxx:xxx"

Test script (with minio container) adapted to use 'profile' instead of 'env var'

>regtests/run_spark_sql_s3compatible.sh 

Copy link
Contributor

@collado-mike collado-mike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a much better impl, IMO. Thanks for the work - it's a feature a lot of folks want, so I'm happy to see it implemented.

Comment on lines +37 to +39
// 5 is the approximate max allowed locations for the size of AccessPolicy when LIST is required
// for allowed read and write locations for sub-scoping credentials.
@JsonIgnore private static final int MAX_ALLOWED_LOCATIONS = 5;
Copy link
Contributor

@flyrain flyrain Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest remove it per discussion here, #1068 (comment). So that we don't have to introduce extra config later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I will rebase after #1068 merged

Copy link
Contributor

@flyrain flyrain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for working on it. Getting close. Left some comments. Other than the server side env variable settings, can we also have unit tests for two new classes added?

// storing properties
super(storageType, allowedLocations);
validateMaxAllowedLocations(MAX_ALLOWED_LOCATIONS);
this.s3PathStyleAccess = s3PathStyleAccess;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refer to the field(s3PathStyleAccess) in this class, which can be set to null. In that case, storageConfig.getS3PathStyleAccess().toString() will throw a NPE.

lefebsy and others added 2 commits March 5, 2025 18:12
…ompatible/S3CompatibleCredentialsStorageIntegration.java

Co-authored-by: Yufei Gu <[email protected]>
…ompatible/S3CompatibleCredentialsStorageIntegration.java

Co-authored-by: Yufei Gu <[email protected]>
@lefebsy
Copy link
Author

lefebsy commented Mar 7, 2025

I refer to the field(s3PathStyleAccess) in this class, which can be set to null. In that case, storageConfig.getS3PathStyleAccess().toString() will throw a NPE

There is a default "false" in rest spec. But ok 👍

@flyrain
Copy link
Contributor

flyrain commented Mar 7, 2025

I refer to the field(s3PathStyleAccess) in this class, which can be set to null. In that case, storageConfig.getS3PathStyleAccess().toString() will throw a NPE

There is a default "false" in rest spec. But ok 👍

I agreed that it's fine if the object is always created from the REST. But the class itself could be constructed internally in the future. I don't want a NPE surprise in that case.

@lefebsy
Copy link
Author

lefebsy commented Mar 8, 2025

Thanks a lot for working on it. Getting close. Left some comments. Other than the server side env variable settings, can we also have unit tests for two new classes added?

Is it ok for you if I do UT only for config class and move to dockerized tests for credentials class ?
I am really not good in the mock code area.

I can easily add minio to regtests docker-compose and enrich "regtests/t_spark_sql/"

My first try just learned me not to name script with "S3" inside otherwise it's trapped by AWS stop pattern and never run.

@flyrain
Copy link
Contributor

flyrain commented Mar 9, 2025

Is it ok for you if I do UT only for config class and move to dockerized tests for credentials class ? I am really not good in the mock code area.

I can easily add minio to regtests docker-compose and enrich "regtests/t_spark_sql/"

My first try just learned me not to name script with "S3" inside otherwise it's trapped by AWS stop pattern and never run.

I'm OK with a followup PR to resolve the mock related unit tests. We can get some help from the community. Can you file an issue to track that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants