Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting Exception in thread "main" org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: deltaSharing. Please find packages at https://spark.apache.org/third-party-projects.html. while trying to read table as dataframe from a share. #428

Open
mohika-knoldus opened this issue Oct 25, 2023 · 5 comments

Comments

@mohika-knoldus
Copy link

import io.delta.sharing.client
import org.apache.spark.sql.SparkSession

object ReadSharedData extends App {

val spark = SparkSession.builder()
.master("local[1]")
.appName("Read Shared Data")
.getOrCreate()

val profilePath = "/home/knoldus/Desktop/Delta Open Sharing/resources/config.share"
val sharedFiles = client.DeltaSharingRestClient(profilePath).listAllTables()
sharedFiles.foreach(println) /// this works fine and lists all the tables in the share provided by data provider.

val popular_products_df = spark.read.format("deltaSharing").load("/home/knoldus/Desktop/Delta Open Sharing/resources/config.share#checkout_data_products.data_products.popular_products_data")
popular_products_df.show()

@oliverangelil
Copy link

@mohika-knoldus did you resolve this? I'm having the same issue.

@mohika-knoldus
Copy link
Author

No @oliverangelil .

@oliverangelil
Copy link

oliverangelil commented Mar 14, 2024

@mohika-knoldus

The solution was to install apache Hadoop.
If you add some config to your spark context it will download it automatically:

spark = (SparkSession
.builder
.config('spark.jars.packages', 'org.apache.hadoop:hadoop-azure:3.3.1,io.delta:delta-core_2.12:2.2.0,io.delta:delta-sharing-spark_2.12:0.6.2')
.config('spark.sql.extensions', 'io.delta.sql.DeltaSparkSessionExtension')
.config('spark.sql.catalog.spark_catalog', 'org.apache.spark.sql.delta.catalog.DeltaCatalog')
.getOrCreate()
) 

Or you can download it from the website.

Then you can read the table in like this:
delta_sharing.load_as_spark(table_url).show()
or like this:
spark.read.format("deltasharing").load(table_url).limit(100)

You can alternatively read the table in without Hadoop, if you use delta_sharing.load_as_pandas(table_url, limit=10)

@mohika-knoldus
Copy link
Author

so either there is a dependency on python library or apache hadoop at the end ?

@mohika-knoldus
Copy link
Author

Thank you for the solution. @oliverangelil

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants