Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data transport problem #24

Open
jtlz2 opened this issue Mar 13, 2019 · 1 comment
Open

Data transport problem #24

jtlz2 opened this issue Mar 13, 2019 · 1 comment

Comments

@jtlz2
Copy link

jtlz2 commented Mar 13, 2019

Having deployed using your charts, and after a hello-world pi calculation, I am trying to execute some simple commands within Jupyter, based on https://github.com/jadianes/spark-py-notebooks/tree/master/nb1-rdd-creation

Note that the kernel has to be set manually to python2, since it defaults to python3.

from pyspark.sql import SparkSession
import urllib

spark = SparkSession\
      .builder\
      .appName("PythonPi")\
      .config("spark.app.name", "spark-pi")\
      .config("spark.executor.instances", "2")\
      .getOrCreate()

f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")

sc = spark.sparkContext
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)

# Then the next line yields an Error:
raw_data.count()
Py4JJavaErrorTraceback (most recent call last)
[...]
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 24.0 failed 4 times, most recent failure: Lost task 0.3 in stage 24.0 (TID 52, 10.2.0.25, executor 1): java.io.FileNotFoundException: File file:/home/jovyan/kddcup.data_10_percent.gz does not exist

How do I make the data available to all spark workers in the k8s cluster?

@dshirish
Copy link
Contributor

To make the data file available to executors as well, you can keep it on a HDFS compatible file system (for example S3/GCS/HDFS etc.) and use the appropriate URI in sc.textfile() call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants