Try spark-shell with docker image

Blue Marble
4 min readJul 30, 2020

Summary

  1. Introduction
  2. Prerequisites
  3. Create docker network
  4. Create docker-compose
  5. Build docker container and start
  6. Start spark-shell
  7. Conclusion

Introduction

Apache spark with docker

Apache Spark is an open source framework for high speed distributed processing of huge data.
In order to use it in earnest, it is necessary to build a Hadoop cluster consisting of multiple nodes, but you can also try running it in your local environment.
In particular, the tool called spark-shell is based on REPL (Read-Eval-Print-Loop), which allows you to interactively create a Spark program and immediately check the execution result on the spot.
This article will show how use Docker Image of Apache Spark and run spark-shell in the local environment.

Prerequisites

Installed docker and docker-compose

For example, their version are

$ docker -v
Docker version 19.03.6, build 369ce74a3c
$ docker-compose -v
docker-compose version 1.17.1, build unknown

Machine spec

Using machine with a total 8 GB memory, but it works fine with about 4 GB.

$ cat /etc/issue
Ubuntu 18.04.2 LTS
$ cat /proc/meminfo | grep Mem
MemTotal: 8168284 kB
MemFree: 6812556 kB
MemAvailable: 7545960 kB

Create docker network

To use a dedicated docker network and fix the IP address assigned to the container, define it with the following command.

$ docker network create spark-nw --subnet=172.30.0.0/16 --gateway=172.30.0.254

It is possible to define docker network in the docker-compose.yml file.
But it is intentionally created in advance, assuming that it will be shared with another docker-compose.yml file in future.

Create docker-compose

Prepare the following file (docker-compose.yml).
Here, “blueskyareahm/spark-base:latest” is specified as the docker image of Apache spark, but many other images are published and any one is OK basically.
By the way, this image is for spark version 2.3.2 based on alpine:3.10.

version: '2'

services:
spark-master:
image: blueskyareahm/spark-base:latest
hostname: doc-spark-master101.local
container_name: spark-master101
ports:
- 4040:4040
- 8080:8080
- 7077:7077
mem_limit: 1g
networks:
spark-nw:
ipv4_address: 172.30.1.2
command: /spark/bin/spark-class org.apache.spark.deploy.master.Master --host 0.0.0.0
extra_hosts:
- "doc-spark-worker101.local:172.30.1.3"

spark-worker:
image: blueskyareahm/spark-base:latest
hostname: doc-spark-worker101.local
container_name: spark-worker101
depends_on:
- spark-master
ports:
- 8081:8081
mem_limit: 1g
networks:
spark-nw:
ipv4_address: 172.30.1.3
command: /spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077 --host 172.30.1.3
extra_hosts:
- "doc-spark-master101.local:172.30.1.2"

networks:
spark-nw:
external: true

From this docker-compose.yml file, one container will be generated for each spark-master and spark-worker.

Build docker container and start

Run the following command in the directory where docker-compose.yml is located.

$ docker-compose up --build -d
Pulling spark-master (blueskyareahm/spark-base:latest)...
(省略)
Creating spark-master101 ...
Creating spark-master101 ... done
Creating spark-worker101 ...
Creating spark-worker101 ... done

You can confirm with the command that the docker image specified in docker-compose.yml has been downloaded locally.

$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
blueskyareahm/spark-base latest 434f57cb14db 3 days ago 344MB

You can also see that containers is started for each of spark-master and spark-worker.

$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8643575faa0b blueskyareahm/spark-base:latest "/spark/bin/spark-cl…" 7 minutes ago Up 7 minutes 8080/tcp, 0.0.0.0:8081->8081/tcp spark-worker101
1ac9774f5ff9 blueskyareahm/spark-base:latest "/spark/bin/spark-cl…" 8 minutes ago Up 7 minutes 0.0.0.0:4040->4040/tcp, 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp spark-master101

Start spark-shell

Start spark-shell with the following command.
Starting with the spark-shell command in the spark-master container.

$ docker-compose exec spark-master /spark/bin/spark-shell --master spark://localhost:7077

The following interactive shell will start.

Spark context Web UI available at http://doc-spark-master101.local:4040
Spark context available as 'sc' (master = spark://localhost:7077, app id = app-20200727141606-0000).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.2
/_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Try executing the command as follows.

scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)

scala> val rdd = sc.parallelize(list)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> rdd.count
res0: Long = 3

scala> rdd.collect
res1: Array[Int] = Array(1, 2, 3)

The array is declared with val list = List(1,2,3).
Then, based on the array, RDD(Resilient Distributed Datasets) which is characteristic data of Apache spark is generated.
You can confirm that the size of the array is returned by calling count on the generated RDD instance.
You can also check that the contents of the array are returned by calling collect.

Conclusion

The characteristic data type RDD of Apache spark provide APIs roughly classified as “transformation” and “action”.
* rdd.count and rdd.collect are classified as “actions”.
This is a feature of Apache spark, and it is one of the very important factor for proper use.

The provided APIs can be executed from spark-shell, and it is very convenient to check the operation.
You can also try Streaming processing or Machine learning API.
By using docker image, you can prepare a spark-shell execution environment in a minimum of 5 minutes.

--

--