Skip to content

Updated for Spark v2.3.0 #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 42 additions & 10 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,19 +3,20 @@ FROM openjdk:8
MAINTAINER Prashanth Babu <[email protected]>

# Scala related variables.
ARG SCALA_VERSION=2.12.2
ARG SCALA_VERSION=2.11.12
ARG SCALA_BINARY_ARCHIVE_NAME=scala-${SCALA_VERSION}
ARG SCALA_BINARY_DOWNLOAD_URL=http://downloads.lightbend.com/scala/${SCALA_VERSION}/${SCALA_BINARY_ARCHIVE_NAME}.tgz

# SBT related variables.
ARG SBT_VERSION=0.13.15
ARG SBT_VERSION=1.2.3
ARG SBT_BINARY_ARCHIVE_NAME=sbt-$SBT_VERSION
ARG SBT_BINARY_DOWNLOAD_URL=https://dl.bintray.com/sbt/native-packages/sbt/${SBT_VERSION}/${SBT_BINARY_ARCHIVE_NAME}.tgz
ARG SBT_BINARY_DOWNLOAD_URL=https://github.com/sbt/sbt/releases/download/v${SBT_VERSION}/${SBT_BINARY_ARCHIVE_NAME}.tgz

# Spark related variables.
ARG SPARK_VERSION=2.2.0
ARG SPARK_BINARY_ARCHIVE_NAME=spark-${SPARK_VERSION}-bin-hadoop2.7
ARG SPARK_BINARY_DOWNLOAD_URL=http://d3kbcqa49mib13.cloudfront.net/${SPARK_BINARY_ARCHIVE_NAME}.tgz
ARG SPARK_VERSION=2.3.1
ARG SPARK_BINARY_ARCHIVE_PREFIX=spark-${SPARK_VERSION}
ARG SPARK_BINARY_ARCHIVE_NAME=${SPARK_BINARY_ARCHIVE_PREFIX}-bin-hadoop2.7
ARG SPARK_BINARY_DOWNLOAD_URL=https://apache.org/dist/spark/${SPARK_BINARY_ARCHIVE_PREFIX}/${SPARK_BINARY_ARCHIVE_NAME}.tgz

# Configure env variables for Scala, SBT and Spark.
# Also configure PATH env variable to include binary folders of Java, Scala, SBT and Spark.
Expand All @@ -25,31 +26,62 @@ ENV SPARK_HOME /usr/local/spark
ENV PATH $JAVA_HOME/bin:$SCALA_HOME/bin:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin:$PATH

# Download, uncompress and move all the required packages and libraries to their corresponding directories in /usr/local/ folder.
RUN apt-get -yqq update && \
# /etc/init.d/ssh start && \
# apt-get install -yqq vim screen tmux openssh-server && \
RUN echo 'deb http://security.debian.org/debian-security stretch/updates main' >>/etc/apt/sources.list && \
apt-get -yqq update && \
apt-get install -yqq apt-utils && \
apt-get install -yqq vim screen tmux && \
apt-get clean && \
rm -rf /var/lib/apt/lists/* && \
rm -rf /tmp/* && \
wget -qO - ${SCALA_BINARY_DOWNLOAD_URL} | tar -xz -C /usr/local/ && \
wget -qO - ${SBT_BINARY_DOWNLOAD_URL} | tar -xz -C /usr/local/ && \
wget -qO - ${SPARK_BINARY_DOWNLOAD_URL} | tar -xz -C /usr/local/ && \
wget -qO - ${SBT_BINARY_DOWNLOAD_URL} | tar -xz -C /usr/local/ && \
cd /usr/local/ && \
ln -s ${SCALA_BINARY_ARCHIVE_NAME} scala && \
ln -s ${SPARK_BINARY_ARCHIVE_NAME} spark && \
cp spark/conf/log4j.properties.template spark/conf/log4j.properties && \
sed -i -e s/WARN/ERROR/g spark/conf/log4j.properties && \
sed -i -e s/INFO/ERROR/g spark/conf/log4j.properties
sed -i -e s/INFO/ERROR/g spark/conf/log4j.properties && \
cat<< EOF > /root/build.sbt
name := "my-spark"
version := "0.1.0"
scalaVersion := "${SCALA_VERSION}"

libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "${SPARK_VERSION}" withSources(),
"org.apache.spark" %% "spark-streaming" % "${SPARK_VERSION}" withSources(),
"org.apache.spark" %% "spark-sql" % "${SPARK_VERSION}" withSources(),
"org.apache.spark" %% "spark-hive" % "${SPARK_VERSION}" withSources(),
"org.apache.spark" %% "spark-streaming-twitter" % "${SPARK_VERSION}" withSources(),
"org.apache.spark" %% "spark-mllib" % "${SPARK_VERSION}" withSources(),
"org.apache.spark" %% "spark-csv" % "${SPARK_VERSION}" withSources()
)
EOF

#turned off sshd
#RUN ssh-keygen -t RSA -f ~/.ssh/id_rsa -N '' && \
# mv ~/.ssh/id_rsa.pub ~/.ssh/authorized_keys && \
# ssh-keyscan localhost > ~/.ssh/known_hosts && \
# /etc/init.d/ssh start & && \
RUN /usr/local/spark-${SPARK_VERSION}-bin-hadoop2.7/sbin/start-all.sh


# We will be running our Spark jobs as `root` user.
USER root

# Allow jobs to be mounted externally
VOLUME ["/root/data","/root/scripts"]

# Working directory is set to the home folder of `root` user.
WORKDIR /root

# Expose ports for monitoring.
# SparkContext web UI on 4040 -- only available for the duration of the application.
# Spark master’s web UI on 8080.
# Spark worker web UI on 8081.
# EXPOSE 4040 8080 8081 22 #currently running with SSHD off
EXPOSE 4040 8080 8081

CMD ["/bin/bash"]
CMD ["/usr/local/spark/bin/spark-shell"]
18 changes: 13 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,25 @@ Apache Spark Docker image is available directly from [https://index.docker.io](h

This image contains the following softwares:

* OpenJDK 64-Bit v1.8.0_131
* Scala v2.12.2
* SBT v0.13.15
* OpenJDK 64-Bit v1.8.0_162
* Scala v2.11.8
* SBT v1.1.2
* Apache Spark v2.2.0



## Various versions of Spark Images
Depending on the version of the Spark Image you want, please run the corresponding command.<br>
Latest image is always the most recent version of Apache Spark available. As of 11th July, 2017 it is v2.2.0.

### Apache Spark latest [i.e. v2.2.0]
### Apache Spark latest [i.e. v2.3.0]
[Dockerfile for Apache Spark v2.3.0](https://github.com/A140233/docker-spark)

docker pull A140233/my-spark:2.3.0
new: run-spark shell script run run/rebuild a standalone container
see manpage: man -M . 1 run-spark

### Apache Spark v2.2.0
[Dockerfile for Apache Spark v2.2.0](https://github.com/P7h/docker-spark)

docker pull p7hb/docker-spark
Expand Down Expand Up @@ -255,4 +263,4 @@ If you find any issues or would like to discuss further, please ping me on my Tw

## License [![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)
Copyright &copy; 2016 Prashanth Babu.<br>
Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
Licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0).
17 changes: 17 additions & 0 deletions man1/run-spark.1
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
.\" Manpage for run-spark
.\" Contact rbailey working at AGL for errors
.TH run-spark 1 "03 Jun 2018" "1.0" "run-spark man page"
.SH NAME
run-spark \- run a spark container in local mode
.SH SYNOPSIS
\./run-spark [--restart|--reinit|--rebuild|-s|-i|-b]
.SH DESCRIPTION
run-spark is a helper script to build a spark instance. without arguments it will run the minimum to get the instance to work, other options will allow various points to restart the container, reinitialize/rerun the container or rebuild it from the dockerfile. This also uses different ports to standard to attempt to avoid conflict from an existing instance
.SH OPTIONS
The run-spark provides restart/reinit/rebuild options if your container is corrupted or needs changes
.SH SEE ALSO
docker(1)
.SH BUGS
No known bugs.
.SH AUTHOR
Rupert Bailey at AGL
80 changes: 80 additions & 0 deletions run-spark
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env bash

SPARK_VERSION=`sed -n '/SPARK_VERSION/s/.*=\([0-9]\+\.[0-9]\+\.[0-9]\+\)/\1/p' Dockerfile`

function usage(){
echo "$0 [-s|-i|-b|--restart|--reinit|--rebuild]"
exit 1
}

TEMP=$(getopt -o sib --long restart,reinit,rebuild \
-n "$0" -- "$@")

if [ $? != 0 ] ; then usage >&2 ; exit 1 ; fi

# Note the quotes around `$TEMP': they are essential!
eval set -- "$TEMP"

mystart=false
init=false
build=false

while true ; do
case "$1" in
-s | --restart ) mystart=true ; shift ;;
-i | --reinit ) init=true ; shift ;;
-b | --rebuild ) build=true ; shift ;;
-- ) shift ; break ;;
*) echo "Internal error!" ; exit 1 ;;
esac
done
if [[ ${arg[0]} != '' ]]; then
echo "Remaining arguments:"
for arg do echo '--> '"\`$arg'" ; done
fi

#stop the container if running and asked for restart
declare -a SPARK_CONTAINERS_RUNNING=$(docker ps --quiet --filter name=spark)
if [[ ${SPARK_CONTAINERS_RUNNING[0]} != '' && ( $mystart == true || $init == true || $build == true ) ]]; then docker stop $SPARK_CONTAINERS_RUNNING; fi

#delete the container if existing and asked for reinit
declare -a SPARK_CONTAINERS_STOPPED=$(docker ps --quiet --all --filter name=spark)
if [[ ${SPARK_CONTAINERS_STOPPED[0]} != '' && ( $init == true || $build == true ) ]]; then docker rm $SPARK_CONTAINERS_STOPPED; fi

#drop the image if visible and asked for rebuild
declare -a SPARK_IMAGES=$(docker images --quiet --all my-spark:${SPARK_VERSION})
if [[ ${SPARK_IMAGES[0]} != '' && $build == true ]]; then docker rmi $SPARK_IMAGES; fi

#build the new image if Dockerfile exists
if ! [[ -f 'Dockerfile' ]]; then echo "Dockerfile absent! Exiting..."; exit; fi

#and only if image name is available
declare -a SPARK_IMAGES=$(docker images --quiet --all my-spark:${SPARK_VERSION})
if [[ ${SPARK_IMAGES[0]} == '' ]]; then
echo "Docker is absent, building it...";
docker build --tag my-spark:${SPARK_VERSION} .
fi

#run the container if image exists but container does not
declare -a SPARK_IMAGES=$(docker images --quiet --all my-spark:${SPARK_VERSION})
declare -a SPARK_CONTAINERS=$(docker ps --quiet --all --filter name=spark)

#exit if the image doesn't exist
if [[ (${SPARK_IMAGES[0]} == '' ) ]]; then echo "image doesn't exist! exiting..."; exit; fi

#run if container does not exist
if [[ ( ${SPARK_CONTAINERS[0]} == '' ) ]]; then
echo "container doesn't exist, great, will spin it up..." ;
docker run --detach --tty --publish 8022:22 --publish 4041:4040 --publish 8090:8080 --publish 8091:8081 --hostname spark --name=spark my-spark:${SPARK_VERSION}
else
echo "container exists already, will check if it needs starting instead..."
fi

#start if start container isn't started
declare -a SPARK_CONTAINERS=$(docker ps --quiet --all --filter name=spark)
if [[ $(docker inspect -f {{.State.Running}} ${SPARK_CONTAINERS[0]}) == 'true' ]]; then
echo "...container is running"
else
echo "...starting an existing container instead."
docker start ${SPARK_CONTAINERS[0]}
fi