Super simple example of Apache Spark Streaming DStreams (Discretized Streams)

Blue Marble
4 min readOct 29, 2020

In this article, introduce super simple Spark Streaming application.

Two Versions of Streaming

  1. DStreams : traditional model based on RDD API
  2. Structured Streaming : new model based on SparkSQL engine

Structured Streaming is newer and told more elegant programming model.
However still almost implementation seems done by DStreams model in general.

In this article, let’s see the DStreams model.
Programming language is Java 8.

DStreams model

See the DStream process 1 by 1.

Source: https://spark.apache.org/docs/latest/streaming-programming-guide.html

  1. We have some kind of input data stream which is a continuous stream.
    This data means coming events with random intervals continuously.
    We can easy to image with something log messages.
  2. The data will feed into Streaming modules and batches of that input data into blocks.
    These batches of input data are just going to become RDDs.
    If we configure the interval of streaming, for example every second, every second Spark streaming is going to create an RDD based on the data that has been received in that last second.
  3. And these RDDs are feed into Spark Engine and this is where we can do our programming for operating on these RDDs.
    Actually, there are unbounded number of RDDs are gathered together into this abstraction called a DStream.
    DStream is all of these batches of input data kind of combined together.
    It means continuous data stream in Spark streaming to convert it into a series of discrete batches of input data.

For simulate streaming input

Just choose LOG level “INFO”, “ERROR” randomly, and output via port 8889.
Let me call the application as “StreamLogGenerator”.
This is not spark application.

StreamLogGenerator.java

Spark Streaming application

Firstly, to add some dependency artifact of Spark or Logging.
For example, if use maven pom.xml.

pom.xml

Next is spark streaming application.
Let me call the application as “LogStreamProcessor”.

LogStreamProcessor.java

  1. Set logger
    To minimize output log on console, set log level as “ERROR”, otherwise there are many log lines from Spark libraries.
  2. Create spark streaming context
    As same with Spark core application, Spark streaming also have to prepare context.
    Spark streaming context has Durations parameter which specify interval of streaming process.
  3. Create Receiver
    Spark streaming context provide some sort of Stream. For example, fileStream, queueStream, soketStream.
    To receive via Socket, we can use soketStream, however socketStream requires to converter how convert input data.
    socketTextStream requires just “hostname” and “port”, this is simple.
  4. Receive data
    This receiver will receive data, if there are any input from specified port.
    In this example, just pick up 1 by 1 and output.
  5. Start streaming
    When call sc.start(), Spark streaming will start computation.
    And sc.awaitTermination(), will wait for the computation to terminate.

Try to execute the application

Firstly, start “StreamLogGenerator” which will output dummy LOG into specified port(8889).

And next, start “LogStreamProcessor” which will receive dummy LOG via specified port(8889) by Spark Streaming.

We will see received message in our console like below.
This message should be updated in every 5 sec, because specified the Duration as 5 sec in Spark Streaming context.

(Note) If stopped “StreamLogGenerator” or “LogStreamProcessor”, we have to restart both application to resume the process.

Now, we became to can implement super simple Spark Streaming application.
When we learn any new technology, we can start from simple example every time.
As next step, we can try to transform or aggregate the input data.

--

--