Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark Client StorageID support #8918

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

itaigilo
Copy link
Contributor

@itaigilo itaigilo commented Apr 1, 2025

Supporting Spark Client usage for lakeFS servers with multiple blockstores.
Also includes GC support.

Required updating lakeFS to the latest version.

Still working on testing it, but this can be reviewed.

@itaigilo itaigilo added include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached msb labels Apr 1, 2025
Copy link

github-actions bot commented Apr 1, 2025

E2E Test Results - Quickstart

12 passed

Copy link

github-actions bot commented Apr 1, 2025

E2E Test Results - DynamoDB Local - Local Block Adapter

13 passed, 1 skipped

@itaigilo itaigilo requested review from a team, nopcoder and arielshaqed April 2, 2025 09:26
@itaigilo itaigilo marked this pull request as ready for review April 2, 2025 10:43
Comment on lines +225 to +227
if (storageConfigList.isEmpty || storageConfigList.size() == 1) {
cfg.getStorageConfig
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of a non empty list, we still need to pick the configuration from the list - right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, it'll be returned in the non-list part of the response.
Same handling as in the other clients.

@@ -168,7 +168,7 @@ class ApiClient private (conf: APIConfigurations) {
val storageNamespace = key.storageClientType match {
case StorageClientType.HadoopFS =>
ApiClient
.translateURI(URI.create(repo.getStorageNamespace), getBlockstoreType)
.translateURI(URI.create(repo.getStorageNamespace), getBlockstoreType(repo.getStorageId))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The storage namespace information is cached in this client.
The blockstore type is not cached and it means that we call the server multiple times.
Suggest to add this information to the cache or cache the configucation at the same level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a getRepo() call to the server, but it's not here, and only for the GC.

What cache are you suggesting here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was looking at the namespace cache and missed the key to namespace cache - can ignore my comment.

Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks like a worthwhile change! Surprised its scope is so low. Let's add how we tested this.

@@ -42,7 +42,7 @@ buildInfoPackage := "io.treeverse.clients"
enablePlugins(S3Plugin, BuildInfoPlugin)

libraryDependencies ++= Seq(
"io.lakefs" % "sdk" % "1.0.0",
"io.lakefs" % "sdk" % "1.53.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have this as a separate PR! It's not that I'm afraid, the lakeFS compatibility guarantees means this is a perfectly safe change. But I'd still rather be careful because I'm actually scared by this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd don't mind separating this, but since the CI tests pass here, I'm curious -
What difference will it make?
If merging the lib bump and right after it the MSB changes -
How will it give more confidence?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we squash merges, 2 PRs give a more detailed commit history. That makes it easier to git bisect and/or roll back some changes. But not critical.

cfg.getStorageConfig
val storageConfigList = cfg.getStorageConfigList
if (storageConfigList.isEmpty || storageConfigList.size() == 1) {
cfg.getStorageConfig
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still check storageID when the list has size 1 and specifies a storage ID, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm complying with the other clients (WebUI, lakectl, Everest) -
Were doing the same check.
It's about making the distinction, and using the list only for actual MSB cases.

Copy link
Contributor Author

@itaigilo itaigilo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @arielshaqed and @nopcoder for your review.

As written, still on to testing this 😐
Will update afterwards.

Meanwhile, addressed your comments.

@@ -42,7 +42,7 @@ buildInfoPackage := "io.treeverse.clients"
enablePlugins(S3Plugin, BuildInfoPlugin)

libraryDependencies ++= Seq(
"io.lakefs" % "sdk" % "1.0.0",
"io.lakefs" % "sdk" % "1.53.1",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd don't mind separating this, but since the CI tests pass here, I'm curious -
What difference will it make?
If merging the lib bump and right after it the MSB changes -
How will it give more confidence?

@@ -168,7 +168,7 @@ class ApiClient private (conf: APIConfigurations) {
val storageNamespace = key.storageClientType match {
case StorageClientType.HadoopFS =>
ApiClient
.translateURI(URI.create(repo.getStorageNamespace), getBlockstoreType)
.translateURI(URI.create(repo.getStorageNamespace), getBlockstoreType(repo.getStorageId))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm adding a getRepo() call to the server, but it's not here, and only for the GC.

What cache are you suggesting here?

cfg.getStorageConfig
val storageConfigList = cfg.getStorageConfigList
if (storageConfigList.isEmpty || storageConfigList.size() == 1) {
cfg.getStorageConfig
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm complying with the other clients (WebUI, lakectl, Everest) -
Were doing the same check.
It's about making the distinction, and using the list only for actual MSB cases.

Comment on lines +225 to +227
if (storageConfigList.isEmpty || storageConfigList.size() == 1) {
cfg.getStorageConfig
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In such case, it'll be returned in the non-list part of the response.
Same handling as in the other clients.

@itaigilo itaigilo requested review from nopcoder and arielshaqed April 3, 2025 10:26
Copy link
Contributor

@arielshaqed arielshaqed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing real to add, and only comment is not critical. StillLGTM.

@@ -42,7 +42,7 @@ buildInfoPackage := "io.treeverse.clients"
enablePlugins(S3Plugin, BuildInfoPlugin)

libraryDependencies ++= Seq(
"io.lakefs" % "sdk" % "1.0.0",
"io.lakefs" % "sdk" % "1.53.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we squash merges, 2 PRs give a more detailed commit history. That makes it easier to git bisect and/or roll back some changes. But not critical.

Copy link
Contributor

@nopcoder nopcoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
include-changelog PR description should be included in next release changelog minor-change Used for PRs that don't require issue attached msb
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants