From 477db7272bce021661905be96630937377ab06d7 Mon Sep 17 00:00:00 2001 From: ouwen Date: Wed, 14 Feb 2018 03:08:54 -0500 Subject: [PATCH 1/2] Organized README.md. The README.md has been reorganized to make it easy for users to find a clean spark standalone dockerfile. Additionally we no longer rely on the spark hostname and hard coded local ip. A docker-compose file will be added later for a quick one line start. --- README.md | 262 ++++++++++++++++++++++++++---------------------------- 1 file changed, 128 insertions(+), 134 deletions(-) diff --git a/README.md b/README.md index 2500d88..713a9c3 100644 --- a/README.md +++ b/README.md @@ -1,88 +1,165 @@ # docker-spark [![](https://images.microbadger.com/badges/version/p7hb/docker-spark.svg)](http://microbadger.com/images/p7hb/docker-spark) ![](https://img.shields.io/docker/automated/p7hb/docker-spark.svg) [![Docker Pulls](https://img.shields.io/docker/pulls/p7hb/docker-spark.svg)](https://hub.docker.com/r/p7hb/docker-spark/) [![Size](https://images.microbadger.com/badges/image/p7hb/docker-spark.svg)](https://microbadger.com/images/p7hb/docker-spark) -Dockerfiles for ***Apache Spark***.
-Apache Spark Docker image is available directly from [https://index.docker.io](https://hub.docker.com/u/p7hb/ "» Docker Hub"). +This repo contains Dockerfiles for ***Apache Spark*** for running in *standalone* mode. Standalone mode is the easiest to set up and will provide almost all the same features as the other cluster managers if you are only running Spark. -This image contains the following softwares: +Apache Spark Docker image is available directly from [docker](https://hub.docker.com/u/p7hb/ "» Docker Hub"). -* OpenJDK 64-Bit v1.8.0_131 -* Scala v2.12.2 -* SBT v0.13.15 -* Apache Spark v2.2.0 +# Quickstart + - [docker-compose quickstart](#docker-compose-start) + - [manual start](#manual-start) + +## Docker Compose Start + - TODO + +## Manual Start +### Manual Step 1: Get the latest image +There are 2 ways of getting this image: + +1. Build this image using [`Dockerfile`](Dockerfile) OR +2. Pull the image directly from DockerHub. + +#### Build the latest image +Copy the [`Dockerfile`](Dockerfile) to a folder on your local machine and then invoke the following command. + + git clone https://github.com/Ouwen/docker-spark.git && cd docker-spark + docker build -t p7hb/docker-spark . + +#### Pull the latest image + + docker pull p7hb/docker-spark + + +### Manual Step 2: Run Spark image +#### Run the latest image i.e. Apache Spark `2.2.0` +Spark latest version as on 11th July, 2017 is `2.2.0`. So, `:latest` or `2.2.0` both refer to the same image. + + docker run -it -p 7077:7077 -p 4040:4040 -p 8080:8080 -p 8081:8081 p7hb/docker-spark + +The above step will launch and run the bash shell into the latest image. We preset a couple ports for the following purposes: + * `7077` is the port bind for spark master process + * `8080` is the port bind for the spark master webui + * `8081` is the port bind for the spark worker webui + * `4040` is the port bind the spark + +### Sanity Check +All the required binaries have been added to the `PATH`. Run the following in a running container. + +#### Start Spark Master + + start-master.sh + +#### Start Spark Slave + + start-slave.sh spark://0.0.0.0:7077 + +#### Execute Spark job for calculating `Pi` Value + + spark-submit --class org.apache.spark.examples.SparkPi --master spark://0.0.0.0:7077 $SPARK_HOME/examples/jars/spark-examples*.jar 100 + ....... + ....... + Pi is roughly 3.140495114049511 + +#### Start Spark Shell + + spark-shell --master spark://0.0.0.0:7077 + +#### View Spark Master WebUI console + +[`http://localhost:8080/`](http://localhost:8080/) + +#### View Spark Worker WebUI console + +[`http://localhost:8081/`](http://localhost:8081/) + +#### View Spark WebUI console +Only available for the duration of the application. + +[`http://localhost:4040/`](http://localhost:4040/) + + +# Further documentation +## Misc Docker commands + +### Find IP Address of the Docker machine +This is the IP Address which needs to be used to look upto for all the exposed ports of our Docker container. + + docker-machine ip default + +### Find all the running containers + + docker ps + +### Find all the running and stopped containers + + docker ps -a + +### Show running list of containers + + docker stats --all shows a running list of containers. + +### Find IP Address of a specific container + + docker inspect <> | grep IPAddress + +### Open new terminal to a Docker container +We can open new terminal with new instance of container's shell with the following command. + + docker exec -it <> /bin/bash #by Container ID + +OR + + docker exec -it <> /bin/bash #by Container Name ## Various versions of Spark Images Depending on the version of the Spark Image you want, please run the corresponding command.
Latest image is always the most recent version of Apache Spark available. As of 11th July, 2017 it is v2.2.0. -### Apache Spark latest [i.e. v2.2.0] +#### Apache Spark latest [i.e. v2.2.0] [Dockerfile for Apache Spark v2.2.0](https://github.com/P7h/docker-spark) docker pull p7hb/docker-spark -### Apache Spark v2.2.0 +#### Apache Spark v2.2.0 [Dockerfile for Apache Spark v2.2.0](https://github.com/P7h/docker-spark/tree/2.2.0) docker pull p7hb/docker-spark:2.2.0 -### Apache Spark v2.1.1 +#### Apache Spark v2.1.1 [Dockerfile for Apache Spark v2.1.1](https://github.com/P7h/docker-spark/tree/2.1.1) docker pull p7hb/docker-spark:2.1.1 -### Apache Spark v2.1.0 +#### Apache Spark v2.1.0 [Dockerfile for Apache Spark v2.1.0](https://github.com/P7h/docker-spark/tree/2.1.0) docker pull p7hb/docker-spark:2.1.0 -### Apache Spark v2.0.2 +#### Apache Spark v2.0.2 [Dockerfile for Apache Spark v2.0.2](https://github.com/P7h/docker-spark/tree/2.0.2) docker pull p7hb/docker-spark:2.0.2 -### Apache Spark v2.0.1 +#### Apache Spark v2.0.1 [Dockerfile for Apache Spark v2.0.1](https://github.com/P7h/docker-spark/tree/2.0.1) docker pull p7hb/docker-spark:2.0.1 -### Apache Spark v2.0.0 +#### Apache Spark v2.0.0 [Dockerfile for Apache Spark v2.0.0](https://github.com/P7h/docker-spark/tree/2.0.0) docker pull p7hb/docker-spark:2.0.0 -### Apache Spark v1.6.3 +#### Apache Spark v1.6.3 [Dockerfile for Apache Spark v1.6.3](https://github.com/P7h/docker-spark/tree/1.6.3) docker pull p7hb/docker-spark:1.6.3 -### Apache Spark v1.6.2 +#### Apache Spark v1.6.2 [Dockerfile for Apache Spark v1.6.2](https://github.com/P7h/docker-spark/tree/1.6.2) - docker pull p7hb/docker-spark:1.6.2 - - -## Get the latest image -There are 2 ways of getting this image: - -1. Build this image using [`Dockerfile`](Dockerfile) OR -2. Pull the image directly from DockerHub. - -### Build the latest image -Copy the [`Dockerfile`](Dockerfile) to a folder on your local machine and then invoke the following command. - - docker build -t p7hb/docker-spark . - -### Pull the latest image - - docker pull p7hb/docker-spark - - -## Run Spark image -### Run the latest image i.e. Apache Spark `2.2.0` -Spark latest version as on 11th July, 2017 is `2.2.0`. So, `:latest` or `2.2.0` both refer to the same image. - - docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark + docker pull p7hb/docker-spark:1.6.2 ### Run images of previous versions Other Spark image versions of this repository can be booted by suffixing the image with the Spark version. It can have values of `2.2.0`, `2.1.1`, `2.1.0`, `2.0.2`, `2.0.1`, `2.0.0`, `1.6.3` and `1.6.2`. @@ -119,20 +196,12 @@ Other Spark image versions of this repository can be booted by suffixing the ima docker run -it -p 4040:4040 -p 8080:8080 -p 8081:8081 -h spark --name=spark p7hb/docker-spark:1.6.2 -The above step will launch and run the image with: - -* `root` is the user we logged into. - * `spark` is the container name. - * `spark` is host name of this container. - * This is very important as Spark Slaves are started using this host name as the master. - * The container exposes ports 4040, 8080, 8081 for Spark Web UI console(s). - ## Check softwares and versions - -### Host name - - root@spark:~# hostname - spark +This image contains the following softwares: +* OpenJDK 64-Bit v1.8.0_131 +* Scala v2.12.2 +* SBT v0.13.15 +* Apache Spark v2.2.0 ### Java @@ -150,11 +219,11 @@ The above step will launch and run the image with: Running `sbt about` will download and setup SBT on the image. -### Spark +### Spark Scala ``` root@spark:~# spark-shell -Spark context Web UI available at http://172.17.0.2:4040 +Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1483032227786). Spark session available as 'spark'. Welcome to @@ -171,88 +240,13 @@ Type :help for more information. scala> ``` -## Spark commands -All the required binaries have been added to the `PATH`. - -### Start Spark Master - - start-master.sh - -### Start Spark Slave - - start-slave.sh spark://spark:7077 - -### Execute Spark job for calculating `Pi` Value - - spark-submit --class org.apache.spark.examples.SparkPi --master spark://spark:7077 $SPARK_HOME/examples/jars/spark-examples*.jar 100 - ....... - ....... - Pi is roughly 3.140495114049511 - - -OR even simpler - - $SPARK_HOME/bin/run-example SparkPi 100 - ....... - ....... - Pi is roughly 3.1413855141385514 - -Please note the first command above expects Spark Master and Slave to be running. And we can even check the Spark Web UI after executing this command. But with the second command, this is not possible. - -### Start Spark Shell - - spark-shell --master spark://spark:7077 - -### View Spark Master WebUI console - -[`http://192.168.99.100:8080/`](http://192.168.99.100:8080/) - -### View Spark Worker WebUI console - -[`http://192.168.99.100:8081/`](http://192.168.99.100:8081/) - -### View Spark WebUI console -Only available for the duration of the application. - -[`http://192.168.99.100:4040/`](http://192.168.99.100:4040/) - -## Misc Docker commands - -### Find IP Address of the Docker machine -This is the IP Address which needs to be used to look upto for all the exposed ports of our Docker container. - - docker-machine ip default - -### Find all the running containers - - docker ps - -### Find all the running and stopped containers - - docker ps -a - -### Show running list of containers - - docker stats --all shows a running list of containers. - -### Find IP Address of a specific container - - docker inspect <> | grep IPAddress - -### Open new terminal to a Docker container -We can open new terminal with new instance of container's shell with the following command. - - docker exec -it <> /bin/bash #by Container ID - -OR - - docker exec -it <> /bin/bash #by Container Name - - ## Problems? Questions? Contributions? [![Contributions welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg?style=flat)](http://p7h.org/contact/) If you find any issues or would like to discuss further, please ping me on my Twitter handle [@P7h](http://twitter.com/P7h "» @P7h") or drop me an [email](http://p7h.org/contact/ "» Contact me"). ## License [![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html) -Copyright © 2016 Prashanth Babu.
+Copyright © 2016 Prashanth Babu. + +Modified work Copyright © 2018 Ouwen Huang. + Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0). \ No newline at end of file From 459386536488bb8ca9e2faa454591c84d11c1fd9 Mon Sep 17 00:00:00 2001 From: ouwen Date: Wed, 14 Feb 2018 05:51:00 -0500 Subject: [PATCH 2/2] Added docker-compose quick start with juypter notebook. --- README.md | 42 +++++++++++++++++++++++++++++++++++++----- docker-compose.yml | 32 ++++++++++++++++++++++++++++++++ 2 files changed, 69 insertions(+), 5 deletions(-) create mode 100644 docker-compose.yml diff --git a/README.md b/README.md index 713a9c3..c3a4066 100644 --- a/README.md +++ b/README.md @@ -6,14 +6,46 @@ This repo contains Dockerfiles for ***Apache Spark*** for running in *standalone Apache Spark Docker image is available directly from [docker](https://hub.docker.com/u/p7hb/ "» Docker Hub"). # Quickstart - - [docker-compose quickstart](#docker-compose-start) - - [manual start](#manual-start) + - [Docker-Compose Start](#docker-compose-start) + - [Manual Start](#manual-start) ## Docker Compose Start - - TODO +Copy the [`docker-compose.yml`](docker-compose.yml) file and run the following command. + + docker-compose up + +This should run a spark cluster on your host machine at `localhost:7077`. You can connect to it remotely +from any spark shell. A short pyspark example is provided below that will work with the juypter notebook +running at `localhost:8888`. + + +``` + from pyspark import SparkConf, SparkContext + import random + + conf = SparkConf().setAppName('test').setMaster('spark://master:7077') + sc = SparkContext(conf=conf) + + NUM_SAMPLES = 100000 + + def inside(p): + x, y = random.random(), random.random() + return x*x + y*y < 1 + + count = sc.parallelize(xrange(0, NUM_SAMPLES)) \ + .filter(inside).count() + print "Pi is roughly %f" % (4.0 * count / NUM_SAMPLES) +``` + +Be sure that your worker is using the desired amount of cores and memory. These can be set directly +in the [`docker-compose.yml`](docker-compose.yml) file. + + SPARK_WORKER_CORES: 4 + SPARK_WORKER_MEMORY: 2g + ## Manual Start -### Manual Step 1: Get the latest image +### Step 1: Get the latest image There are 2 ways of getting this image: 1. Build this image using [`Dockerfile`](Dockerfile) OR @@ -30,7 +62,7 @@ Copy the [`Dockerfile`](Dockerfile) to a folder on your local machine and then i docker pull p7hb/docker-spark -### Manual Step 2: Run Spark image +### Step 2: Run Spark image #### Run the latest image i.e. Apache Spark `2.2.0` Spark latest version as on 11th July, 2017 is `2.2.0`. So, `:latest` or `2.2.0` both refer to the same image. diff --git a/docker-compose.yml b/docker-compose.yml new file mode 100644 index 0000000..e0c7e81 --- /dev/null +++ b/docker-compose.yml @@ -0,0 +1,32 @@ +version: "3" +services: + master: + image: p7hb/docker-spark:latest + ports: + - 7077:7077 + - 8080:8080 + environment: + SPARK_LOCAL_DIRS: /root/data + volumes: + - spark_storage:/root/data + command: ["/usr/local/spark/bin/spark-class", "org.apache.spark.deploy.master.Master", "--host", "master"] + + worker: + image: p7hb/docker-spark:latest + ports: + - 8081:8081 + volumes: + - spark_storage:/root/data + environment: + SPARK_WORKER_CORES: 4 + SPARK_WORKER_MEMORY: 4g + SPARK_LOCAL_DIRS: /root/data + command: ["/usr/local/spark/bin/spark-class", "org.apache.spark.deploy.worker.Worker", "spark://master:7077"] + + client: + image: ouwen/tensorflow-spark + ports: + - 8888:8888 + +volumes: + spark_storage: \ No newline at end of file