You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having deployed using your charts, and after a hello-world pi calculation, I am trying to execute some simple commands within Jupyter, based on https://github.com/jadianes/spark-py-notebooks/tree/master/nb1-rdd-creation
Note that the kernel has to be set manually to python2, since it defaults to python3.
from pyspark.sql import SparkSession
import urllib
spark = SparkSession\
.builder\
.appName("PythonPi")\
.config("spark.app.name", "spark-pi")\
.config("spark.executor.instances", "2")\
.getOrCreate()
f = urllib.urlretrieve ("http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz", "kddcup.data_10_percent.gz")
sc = spark.sparkContext
data_file = "./kddcup.data_10_percent.gz"
raw_data = sc.textFile(data_file)
# Then the next line yields an Error:
raw_data.count()
Py4JJavaErrorTraceback (most recent call last)
[...]
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 24.0 failed 4 times, most recent failure: Lost task 0.3 in stage 24.0 (TID 52, 10.2.0.25, executor 1): java.io.FileNotFoundException: File file:/home/jovyan/kddcup.data_10_percent.gz does not exist
How do I make the data available to all spark workers in the k8s cluster?
The text was updated successfully, but these errors were encountered:
To make the data file available to executors as well, you can keep it on a HDFS compatible file system (for example S3/GCS/HDFS etc.) and use the appropriate URI in sc.textfile() call.
Having deployed using your charts, and after a hello-world pi calculation, I am trying to execute some simple commands within Jupyter, based on
https://github.com/jadianes/spark-py-notebooks/tree/master/nb1-rdd-creation
Note that the kernel has to be set manually to python2, since it defaults to python3.
How do I make the data available to all spark workers in the k8s cluster?
The text was updated successfully, but these errors were encountered: