Skip to content

[BUG] - ModuleNotFoundError: No module named 'pyspark' #1782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tandav opened this issue Sep 7, 2022 · 11 comments
Closed

[BUG] - ModuleNotFoundError: No module named 'pyspark' #1782

tandav opened this issue Sep 7, 2022 · 11 comments
Labels
type:Bug A problem with the definition of one of the docker images maintained here

Comments

@tandav
Copy link

tandav commented Sep 7, 2022

What docker image(s) are you using?

pyspark-notebook

OS system and architecture running docker image

Ubuntu22.04 aarch64

What Docker command are you running?

docker run --rm -it  jupyter/pyspark-notebook python
>>> import pyspark
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pyspark'

How to Reproduce the problem?

run the command above

Command output

No response

Expected behavior

pyspark should be available without PATH manipulations and startup scripts.

Actual behavior

pyspark is not available in python provided by jupyter/pyspark-notebook image

Anything else?

No response

@tandav tandav added the type:Bug A problem with the definition of one of the docker images maintained here label Sep 7, 2022
@Bidek56
Copy link
Contributor

Bidek56 commented Sep 7, 2022

Why are you adding Python at the end of docker run command?

@bjornjorgensen
Copy link
Contributor

use docker run -d -p 8888:8888 -p 4040:4040 jupyter/pyspark-notebook

@Bidek56
Copy link
Contributor

Bidek56 commented Sep 7, 2022

Or use: docker run --rm -it -p 8888:8888 -p 4040:4040 -v $(PWD):/home/jovyan/work jupyter/pyspark-notebook if you want the local storage.

@bjornjorgensen
Copy link
Contributor

From jupyterlab you can start a terminal
image

@tandav
Copy link
Author

tandav commented Sep 8, 2022

Why are you adding Python at the end of docker run command?

I want to use jupyter/pyspark-notebook as image with pre-installed pyspark to run my python scripts and I don't need jupyter notebook for this.

Is such usage out of scope for docker-stacks project?

@tandav
Copy link
Author

tandav commented Sep 8, 2022

From jupyterlab you can start a terminal image

Yes, I see that when you run jupyter notebook - there is a startup script which sets pyspark environment variables

ln -s "${SPARK_HOME}/sbin/spark-config.sh" /usr/local/bin/before-notebook.d/spark-config.sh

My question is - is it possible to set these variables not only for jupyter notebook but also for regular python?
This docker image is very handy when you quickly need to run pyspark but you don't need jupyter.
Currently you have to manually run that spark-config.sh file or pip install pyspark to make pyspark work for regular python.

@Bidek56
Copy link
Contributor

Bidek56 commented Sep 8, 2022

  1. You can extend the jupyter/pyspark-notebook dockerfile and change the entry point but this image size may be too big for your purpose
  2. Use a different Image which doesn't not have Jupyter included.

@bjornjorgensen
Copy link
Contributor

bjornjorgensen commented Sep 8, 2022

docker run --rm -it bjornjorgensen/jupyter-spark-master-docker python
Python 3.10.6 | packaged by conda-forge | (main, Aug 22 2022, 20:35:26) [GCC 10.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyspark
>>> 

This is a build with spark master brach.

https://github.com/bjornjorgensen/jupyter-spark-master-docker/blob/fcc09efa27c5ac17087ed79a747c272947da76cb/Dockerfile#L68

You can try add
pip install -e ${SPARK_HOME}/python

EDIT:

@tandav can you try docker run --rm -it bjornjorgensen/pyspark080922_pip:latest python

I did add

RUN pip install -e ${SPARK_HOME}/python
RUN fix-permissions "${CONDA_DIR}" && \
    fix-permissions "/home/${NB_USER}"

To line 58

@mathbunnyru
Copy link
Member

I will add a few details here:

  1. You say you use Ubuntu22.04 aarch64. Then, if you want to use the latest image, please use it like this jupyter/pyspark-notebook:aarch64-latest, because we add this prefix to all arm64 image tags (including latest).
  2. When you add python in the end, you don't run our scripts to set variables properly and so on. You should use start-notebook.sh python. There is already an existing issue that we should setup important things in ENTRYPOINT and not CMD. Move environment setup from start.sh to ENTRYPOINT instead of CMD #1528

That being said, you should use docker run --rm -it jupyter/pyspark-notebook:aarch64-latest start-notebook.sh python.

Please, tell me if this doesn't answer your question and it still doesn't work.

@tandav
Copy link
Author

tandav commented Sep 9, 2022

That being said, you should use docker run --rm -it jupyter/pyspark-notebook:aarch64-latest start-notebook.sh python.

Please, tell me if this doesn't answer your question and it still doesn't work.

@mathbunnyru

Thank you, I tried this command, but start-notebook.sh runs jupyter notebook. I see that start-notebook.sh runs /usr/local/bin/start.sh inside, so I tried:

docker run --rm -it jupyter/pyspark-notebook start.sh python

and it works. I can import pyspark inside this python without ModuleNotFoundError

@mathbunnyru
Copy link
Member

Yes, sorry, you're right, I meant start.sh of course :)
I'm glad I was able to help you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:Bug A problem with the definition of one of the docker images maintained here
Projects
None yet
Development

No branches or pull requests

4 participants