Try spark-shell with docker image

4 min readJul 30, 2020

Summary

Introduction
Prerequisites
Create docker network
Create docker-compose
Build docker container and start
Start spark-shell
Conclusion

Introduction

Apache Spark is an open source framework for high speed distributed processing of huge data.
In order to use it in earnest, it is necessary to build a Hadoop cluster consisting of multiple nodes, but you can also try running it in your local environment.
In particular, the tool called spark-shell is based on REPL (Read-Eval-Print-Loop), which allows you to interactively create a Spark program and immediately check the execution result on the spot.
This article will show how use Docker Image of Apache Spark and run spark-shell in the local environment.

Prerequisites

Installed docker and docker-compose

For example, their version are

$ docker -v
Docker version 19.03.6, build 369ce74a3c$ docker-compose -v
docker-compose version 1.17.1, build unknown

Machine spec

Using machine with a total 8 GB memory, but it works fine with about 4 GB.

$ cat /etc/issue
Ubuntu 18.04.2 LTS$ cat /proc/meminfo | grep Mem
MemTotal:        8168284 kB
MemFree:         6812556 kB
MemAvailable:    7545960 kB

Create docker network

To use a dedicated docker network and fix the IP address assigned to the container, define it with the following command.

$ docker network create spark-nw --subnet=172.30.0.0/16 --gateway=172.30.0.254

It is possible to define docker network in the docker-compose.yml file.
But it is intentionally created in advance, assuming that it will be shared with another docker-compose.yml file in future.

Create docker-compose

Prepare the following file (docker-compose.yml).
Here, “blueskyareahm/spark-base:latest” is specified as the docker image of Apache spark, but many other images are published and any one is OK basically.
By the way, this image is for spark version 2.3.2 based on alpine:3.10.

version: '2'

services:
  spark-master:
    image: blueskyareahm/spark-base:latest
    hostname: doc-spark-master101.local
    container_name: spark-master101
    ports:
      - 4040:4040
      - 8080:8080
      - 7077:7077
    mem_limit: 1g
    networks:
      spark-nw:
        ipv4_address: 172.30.1.2
    command: /spark/bin/spark-class org.apache.spark.deploy.master.Master --host 0.0.0.0
    extra_hosts:
      - "doc-spark-worker101.local:172.30.1.3"

  spark-worker:
    image: blueskyareahm/spark-base:latest
    hostname: doc-spark-worker101.local
    container_name: spark-worker101
    depends_on:
      - spark-master
    ports:
      - 8081:8081
    mem_limit: 1g
    networks:
      spark-nw:
        ipv4_address: 172.30.1.3
    command: /spark/bin/spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077 --host 172.30.1.3
    extra_hosts:
      - "doc-spark-master101.local:172.30.1.2"

networks:
  spark-nw:
    external: true

From this docker-compose.yml file, one container will be generated for each spark-master and spark-worker.

Build docker container and start

Run the following command in the directory where docker-compose.yml is located.

$ docker-compose up --build -d
Pulling spark-master (blueskyareahm/spark-base:latest)...
(省略)
Creating spark-master101 ... 
Creating spark-master101 ... done
Creating spark-worker101 ... 
Creating spark-worker101 ... done

You can confirm with the command that the docker image specified in docker-compose.yml has been downloaded locally.

$ docker images
REPOSITORY                                      TAG                 IMAGE ID            CREATED             SIZE
blueskyareahm/spark-base                        latest              434f57cb14db        3 days ago          344MB

You can also see that containers is started for each of spark-master and spark-worker.

$ docker ps
CONTAINER ID        IMAGE                             COMMAND                  CREATED             STATUS              PORTS                                                                    NAMES
8643575faa0b        blueskyareahm/spark-base:latest   "/spark/bin/spark-cl…"   7 minutes ago       Up 7 minutes        8080/tcp, 0.0.0.0:8081->8081/tcp                                         spark-worker101
1ac9774f5ff9        blueskyareahm/spark-base:latest   "/spark/bin/spark-cl…"   8 minutes ago       Up 7 minutes        0.0.0.0:4040->4040/tcp, 0.0.0.0:7077->7077/tcp, 0.0.0.0:8080->8080/tcp   spark-master101

Start spark-shell

Start spark-shell with the following command.
Starting with the spark-shell command in the spark-master container.

$ docker-compose exec spark-master /spark/bin/spark-shell --master spark://localhost:7077

The following interactive shell will start.

Spark context Web UI available at http://doc-spark-master101.local:4040
Spark context available as 'sc' (master = spark://localhost:7077, app id = app-20200727141606-0000).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Try executing the command as follows.

scala> val list = List(1,2,3)
list: List[Int] = List(1, 2, 3)

scala> val rdd = sc.parallelize(list)
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26

scala> rdd.count
res0: Long = 3                                                                  

scala> rdd.collect
res1: Array[Int] = Array(1, 2, 3)

The array is declared with val list = List(1,2,3).
Then, based on the array, RDD(Resilient Distributed Datasets) which is characteristic data of Apache spark is generated.
You can confirm that the size of the array is returned by calling count on the generated RDD instance.
You can also check that the contents of the array are returned by calling collect.

Conclusion

The characteristic data type RDD of Apache spark provide APIs roughly classified as “transformation” and “action”.
* rdd.count and rdd.collect are classified as “actions”.
This is a feature of Apache spark, and it is one of the very important factor for proper use.

The provided APIs can be executed from spark-shell, and it is very convenient to check the operation.
You can also try Streaming processing or Machine learning API.
By using docker image, you can prepare a spark-shell execution environment in a minimum of 5 minutes.