Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for delta sharing profile via options #573

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

stevenayers
Copy link

@stevenayers stevenayers commented Sep 14, 2024

Add support for delta sharing profile via options (resolves #483).

Task List:

  • Add options to DeltaSharingOptions
  • Unit tests for DeltaSharingOptions
  • DeltaSharingProfile from DeltaSharingOptions
  • Unit tests for DeltaSharingProfileProvider
  • Conditional logic to create from options
  • Unit tests for conditional logic
  • Integration tests
  • Include Examples and Documentation

Examples

// Bearer Token
val df = spark.read.format("deltaSharing")
  .option("shareCredentialsVersion", "1")
  .option("endpoint", "https://example.com/delta-sharing")
  .option("bearerToken", "foobar")
  .option("expirationTime", "2022-01-01T00:00:00Z")
  .load("<share-name>.<schema-name>.<table-name>")
  .limit(10)

// OAuth
val df = spark.read.format("deltaSharing")
  .option("shareCredentialsVersion", "2")
  .option("shareCredentialsType", "oauth_client_credentials")
  .option("endpoint", "https://example.com/delta-sharing")
  .option("tokenEndpoint", "https://example.com/delta-sharing/oauth/token")
  .option("clientId", "abc")
  .option("clientSecret", "xyz")
  .load("<share-name>.<schema-name>.<table-name>")
  .limit(10)
# Bearer Token
df = (
    spark.read.format("deltaSharing")
    .option("shareCredentialsVersion", "1")
    .option("endpoint", "https://example.com/delta-sharing")
    .option("bearerToken", "foobar")
    .option("expirationTime", "2022-01-01T00:00:00Z")
    .load("<share-name>.<schema-name>.<table-name>")
    .limit(10)
)

# OAuth
df = (
    spark.read.format("deltaSharing")
    .option("shareCredentialsVersion", "2")
    .option("shareCredentialsType", "oauth_client_credentials")
    .option("endpoint", "https://example.com/delta-sharing")
    .option("tokenEndpoint", "https://example.com/delta-sharing/oauth/token")
    .option("clientId", "abc")
    .option("clientSecret", "xyz")
    .load("<share-name>.<schema-name>.<table-name>")
    .limit(10)
)

# load_as_spark or load_table_changes_as_spark
import delta_sharing
from delta_sharing.protocol import DeltaSharingProfile

profile = DeltaSharingProfile( # also supports OAuth
    share_credentials_version=1,
    endpoint="https://example.com/delta-sharing",
    bearer_token="foobar",
    expiration_time="2022-01-01T00:00:00Z",
)

df = delta_sharing.load_as_spark(
    f"<share-name>.<schema-name>.<table-name>",
    version=version,
    timestamp=timestamp,
    delta_sharing_profile=profile
)

@stevenayers stevenayers force-pushed the feature/delta-credentials-as-opts branch 3 times, most recently from 81228bf to 6a2e236 Compare September 14, 2024 11:46
@stevenayers
Copy link
Author

@linzhou-db

@stevenayers stevenayers force-pushed the feature/delta-credentials-as-opts branch 3 times, most recently from 8aef4f7 to 856a392 Compare September 15, 2024 06:33
@linzhou-db
Copy link
Collaborator

Hi @stevenayers it seems that you are aware of the previous effort #103, and it seems that we didn't want to proceed with it because it introduces secretes in the code.
I'll discuss this with our team again and get back to you.

@stevenayers
Copy link
Author

stevenayers commented Sep 19, 2024

Hi @stevenayers it seems that you are aware of the previous effort #103, and it seems that we didn't want to proceed with it because it introduces secretes in the code. I'll discuss this with our team again and get back to you.

Hi @linzhou-db, I understand. However, I think some of the feedback provided by the community in #103, which counters that decision, is still very valid. Whether you are reading the secrets from a file or passing them in via options, your library still has "secrets in the code." However, the current method forces us to hard-code our credentials in a plain-text file on our filesystem (which is just as dangerous as storing them in a notebook).

Your colleague's comment here about tempfiles also will not work in Python. You cannot create a tempfile in python which is then used by the underlying Scala code. A true temp file is stored in memory and never flushed to the filesystem, which is required when working across two different processes such as PySpark which communicates with the underlying JVM.

If we look at Databricks' Security Best Practices:

Store and use secrets securely
Integrating with heterogeneous systems requires managing a potentially large set of credentials and safely distributing
them across an organization. Instead of directly entering your credentials into a notebook, use Databricks secrets to store
your credentials and reference them in notebooks and jobs. Databricks secret management allows users to use and share
credentials within Databricks securely. You can also choose to use a third-party secret management service, such as AWS
Secrets Manager or a third-party secret manager.

A dedicated secrets manager should manage Secrets Management. These secret managers exist so that as developers, we do not need to store secrets in plain-text files or hardcode them as variables; instead, they can be stored in memory and passed into the logic that requires them. 😄

Doing this requires an interface such as the one proposed in this PR.

Please reach out to some of the Infrastructure & Security professionals within your organization for their opinions; I think they would share a similar sentiment to myself, @ssimeonov and @shcent.

@stevenayers
Copy link
Author

stevenayers commented Sep 19, 2024

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

@shcent
Copy link

shcent commented Sep 19, 2024

Completely agree with @stevenayers. Having to store the authentication token in a file at all is dangerous. AWS, Azure and Databricks (and I'm sure any other cloud provider not mentioned) all come with secrets managers so that this should not be needed.

Using a temp file in Python is still awkward, especially on databricks where the path has a tendency to change from a Notebook into pure pyspark, and there is always the risk that this is not tidied correctly. Also as @stevenayers mentions in memory versions do not work.

@linzhou-db
Copy link
Collaborator

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

@stevenayers and @shcent thanks for sharing your thoughts, super helpful for me to get more context on this.

I just want to reply acking that we've read your comments and are actively discussing this. Will get back to you soon.

@stevenayers
Copy link
Author

Guidelines from Microsoft's Security Fundamentals: https://learn.microsoft.com/en-us/azure/security/fundamentals/secrets-best-practices#avoid-hardcoding-secrets

Embedding secrets directly into your code or configuration files is a significant security risk. If your codebase is compromised, so are your secrets. Instead, use environment variables or configuration management tools that keep secrets out of your source code. This practice minimizes the risk of accidental exposure and simplifies the process of updating secrets.

OWASP: https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html#51-injection-of-secrets-file-in-memory

Fetch from the secret store (in-memory): A sidecar app/container fetches the secrets it needs directly from a secret manager service without dealing with docker config. This solution allows you to use dynamically constructed secrets without worrying about the secrets being viewable from the file system

@stevenayers and @shcent thanks for sharing your thoughts, super helpful for me to get more context on this.

I just want to reply acking that we've read your comments and are actively discussing this. Will get back to you soon.

Thanks @linzhou-db!

@DanyMariaLee
Copy link

Can you please merge this and release new version of lib? Our project is waiting for this change🙂

@netf
Copy link

netf commented Oct 3, 2024

any news on this one?

@stevenayers stevenayers force-pushed the feature/delta-credentials-as-opts branch from c410c53 to 687be0d Compare October 3, 2024 15:21
@stevenayers
Copy link
Author

@linzhou-db @chakankardb @zsxwing @mateiz any further thoughts on this change? thanks :)

@stevenayers
Copy link
Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

load_as_spark using config as variable instead of file
5 participants