Support for delta sharing profile via options #573

stevenayers · 2024-09-14T07:49:21Z

Add support for delta sharing profile via options (resolves #483).

Task List:

Add options to DeltaSharingOptions
Unit tests for DeltaSharingOptions
DeltaSharingProfile from DeltaSharingOptions
Unit tests for DeltaSharingProfileProvider
Conditional logic to create from options
Unit tests for conditional logic
Integration tests
Include Examples and Documentation

Examples

// Bearer Token
val df = spark.read.format("deltaSharing")
  .option("shareCredentialsVersion", "1")
  .option("endpoint", "https://example.com/delta-sharing")
  .option("bearerToken", "foobar")
  .option("expirationTime", "2022-01-01T00:00:00Z")
  .load("<share-name>.<schema-name>.<table-name>")
  .limit(10)

// OAuth
val df = spark.read.format("deltaSharing")
  .option("shareCredentialsVersion", "2")
  .option("shareCredentialsType", "oauth_client_credentials")
  .option("endpoint", "https://example.com/delta-sharing")
  .option("tokenEndpoint", "https://example.com/delta-sharing/oauth/token")
  .option("clientId", "abc")
  .option("clientSecret", "xyz")
  .load("<share-name>.<schema-name>.<table-name>")
  .limit(10)

# Bearer Token
df = (
    spark.read.format("deltaSharing")
    .option("shareCredentialsVersion", "1")
    .option("endpoint", "https://example.com/delta-sharing")
    .option("bearerToken", "foobar")
    .option("expirationTime", "2022-01-01T00:00:00Z")
    .load("<share-name>.<schema-name>.<table-name>")
    .limit(10)
)

# OAuth
df = (
    spark.read.format("deltaSharing")
    .option("shareCredentialsVersion", "2")
    .option("shareCredentialsType", "oauth_client_credentials")
    .option("endpoint", "https://example.com/delta-sharing")
    .option("tokenEndpoint", "https://example.com/delta-sharing/oauth/token")
    .option("clientId", "abc")
    .option("clientSecret", "xyz")
    .load("<share-name>.<schema-name>.<table-name>")
    .limit(10)
)

# load_as_spark or load_table_changes_as_spark
import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile

profile = DeltaSharingProfile( # also supports OAuth
    share_credentials_version=1,
    endpoint="https://example.com/delta-sharing",
    bearer_token="foobar",
    expiration_time="2022-01-01T00:00:00Z",
)

df = delta_sharing.load_as_spark(
    f"<share-name>.<schema-name>.<table-name>",
    version=version,
    timestamp=timestamp,
    delta_sharing_profile=profile
)

stevenayers · 2024-09-14T11:51:08Z

@linzhou-db

linzhou-db · 2024-09-18T21:21:51Z

Hi @stevenayers it seems that you are aware of the previous effort #103, and it seems that we didn't want to proceed with it because it introduces secretes in the code.
I'll discuss this with our team again and get back to you.

stevenayers · 2024-09-19T05:37:18Z

Hi @stevenayers it seems that you are aware of the previous effort #103, and it seems that we didn't want to proceed with it because it introduces secretes in the code. I'll discuss this with our team again and get back to you.

Hi @linzhou-db, I understand. However, I think some of the feedback provided by the community in #103, which counters that decision, is still very valid. Whether you are reading the secrets from a file or passing them in via options, your library still has "secrets in the code." However, the current method forces us to hard-code our credentials in a plain-text file on our filesystem (which is just as dangerous as storing them in a notebook).

Your colleague's comment here about tempfiles also will not work in Python. You cannot create a tempfile in python which is then used by the underlying Scala code. A true temp file is stored in memory and never flushed to the filesystem, which is required when working across two different processes such as PySpark which communicates with the underlying JVM.

If we look at Databricks' Security Best Practices:

Store and use secrets securely
Integrating with heterogeneous systems requires managing a potentially large set of credentials and safely distributing
them across an organization. Instead of directly entering your credentials into a notebook, use Databricks secrets to store
your credentials and reference them in notebooks and jobs. Databricks secret management allows users to use and share
credentials within Databricks securely. You can also choose to use a third-party secret management service, such as AWS
Secrets Manager or a third-party secret manager.

A dedicated secrets manager should manage Secrets Management. These secret managers exist so that as developers, we do not need to store secrets in plain-text files or hardcode them as variables; instead, they can be stored in memory and passed into the logic that requires them. 😄

Doing this requires an interface such as the one proposed in this PR.

Please reach out to some of the Infrastructure & Security professionals within your organization for their opinions; I think they would share a similar sentiment to myself, @ssimeonov and @shcent.

stevenayers · 2024-09-19T05:48:38Z

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

shcent · 2024-09-19T07:02:44Z

Completely agree with @stevenayers. Having to store the authentication token in a file at all is dangerous. AWS, Azure and Databricks (and I'm sure any other cloud provider not mentioned) all come with secrets managers so that this should not be needed.

Using a temp file in Python is still awkward, especially on databricks where the path has a tendency to change from a Notebook into pure pyspark, and there is always the risk that this is not tidied correctly. Also as @stevenayers mentions in memory versions do not work.

linzhou-db · 2024-09-19T16:50:59Z

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

@stevenayers and @shcent thanks for sharing your thoughts, super helpful for me to get more context on this.

I just want to reply acking that we've read your comments and are actively discussing this. Will get back to you soon.

stevenayers · 2024-09-19T16:55:15Z

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

@stevenayers and @shcent thanks for sharing your thoughts, super helpful for me to get more context on this.

I just want to reply acking that we've read your comments and are actively discussing this. Will get back to you soon.

Thanks @linzhou-db!

DanyMariaLee · 2024-10-03T01:59:19Z

Can you please merge this and release new version of lib? Our project is waiting for this change🙂

netf · 2024-10-03T07:09:15Z

any news on this one?

Signed-off-by: Steven Ayers <[email protected]>

…`DeltaSharingProfile`. Signed-off-by: Steven Ayers <[email protected]>

Signed-off-by: Steven Ayers <[email protected]>

stevenayers · 2024-10-14T16:11:48Z

@linzhou-db @chakankardb @zsxwing @mateiz any further thoughts on this change? thanks :)

stevenayers · 2024-10-26T11:30:13Z

stevenayers force-pushed the feature/delta-credentials-as-opts branch 3 times, most recently from 81228bf to 6a2e236 Compare September 14, 2024 11:46

stevenayers force-pushed the feature/delta-credentials-as-opts branch 3 times, most recently from 8aef4f7 to 856a392 Compare September 15, 2024 06:33

linzhou-db requested review from linzhou-db, charlenelyu-db, chakankardb, moderakh, zsxwing and mateiz and removed request for moderakh and charlenelyu-db September 15, 2024 16:40

stevenayers added 5 commits October 3, 2024 16:21

Support for delta sharing profile via options

920b997

Signed-off-by: Steven Ayers <[email protected]>

Move Profile options into shareCredentialsOptions map

2b15f73

Signed-off-by: Steven Ayers <[email protected]>

Use share credentials options if any are specified

af70b8c

Signed-off-by: Steven Ayers <[email protected]>

load_as_spark and load_table_changes_as_spark accept an optional …

fbeed2b

…`DeltaSharingProfile`. Signed-off-by: Steven Ayers <[email protected]>

Update exception message to be more generic so it is not misleading

687be0d

Signed-off-by: Steven Ayers <[email protected]>

stevenayers force-pushed the feature/delta-credentials-as-opts branch from c410c53 to 687be0d Compare October 3, 2024 15:21

Merge branch 'main' into feature/delta-credentials-as-opts

b48d05a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for delta sharing profile via options #573

Support for delta sharing profile via options #573

stevenayers commented Sep 14, 2024 •

edited

Loading

stevenayers commented Sep 14, 2024

linzhou-db commented Sep 18, 2024

stevenayers commented Sep 19, 2024 •

edited

Loading

stevenayers commented Sep 19, 2024 •

edited

Loading

shcent commented Sep 19, 2024

linzhou-db commented Sep 19, 2024

stevenayers commented Sep 19, 2024

DanyMariaLee commented Oct 3, 2024

netf commented Oct 3, 2024

stevenayers commented Oct 14, 2024

stevenayers commented Oct 26, 2024

Support for delta sharing profile via options #573

Are you sure you want to change the base?

Support for delta sharing profile via options #573

Conversation

stevenayers commented Sep 14, 2024 • edited Loading

Examples

stevenayers commented Sep 14, 2024

linzhou-db commented Sep 18, 2024

stevenayers commented Sep 19, 2024 • edited Loading

stevenayers commented Sep 19, 2024 • edited Loading

shcent commented Sep 19, 2024

linzhou-db commented Sep 19, 2024

stevenayers commented Sep 19, 2024

DanyMariaLee commented Oct 3, 2024

netf commented Oct 3, 2024

stevenayers commented Oct 14, 2024

stevenayers commented Oct 26, 2024

stevenayers commented Sep 14, 2024 •

edited

Loading

stevenayers commented Sep 19, 2024 •

edited

Loading

stevenayers commented Sep 19, 2024 •

edited

Loading