. Advertisement .
..3..
. Advertisement .
..4..
The digital era creates an endless amount of data that engineers like you have to process efficiently. Have a look at the Spark Streaming with Kafka example in this guide to learn how to deal with streams of data.
Why Spark Streaming And Kafka Integration?
Streaming is an extension of Spark Core – its primary API. This object allows you to get data from many sources like TCP sockets, Kinesis, and Kafka. You can process the data with complex algorithms built with high-level APIs such as window, join, reduce, and map.
Spark Streaming makes the use of DStream more efficiently, extending the continuous stream for a higher degree level of data abstraction. This guide will help you learn the basics of Spark.
The input stream is received and split by Spark into data smaller batches, which are then processed into the final stream by the Spark engine. This mechanism enables a wide variety of integrations with different types of data sources, including Apache Kafka.
Kafka itself is also a stream-processing platform offering end-to-end solutions for mission-critical applications that need high-performance data integration, streaming analytics, and data pipelines.
It comes with concepts like data producers and consumers. Producers create data based on topics of their choice, to which consumers can subscribe and receive messages. Consumers can act alone or belong to a bigger group.
This reliable application allows for long-term data stream storage, while its efficient functionalities can handle a high volume of messages per second.
By implementing Kafka-Spark integration, you can push data loss to a minimum and save all the received data at the same time.
Spark ensures you can process a huge amount of data, including real-time streams of data. Meanwhile, Kafka makes it possible to ingest events from producers to consumers.
This combination keeps your applications in sync and enables an easy recovery. You can read messages from one or multiple topics in Kafka. This degree of flexibility makes the integration highly scalable and fault tolerant.
Spark Streaming With Kafka Example
To create integration between Spark Streaming and Kafka, you will need a script named build.sbt describing details of the application, such as its library dependencies. It will be executed later to download the modules required to compile and package your application.
Your build.sbt file may look something likes
name := "Spark-Kafka Integration"
version := "0.1"
scalaVersion := "2.13.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "3.2.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % "3.2.0"
To read a streaming dataset from Kafka, you will need to use the function readStream() in PySpark. Remember to pass the file format option “kafka”.
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "yourhost:yourport") \
.option("subscribe", "topic1") \
.load()
If you want to subscribe to one topic with headers, add this option to the function call above:
option("includeHeaders", "true")
Spark also supports subscribing to multiple topics or to a pattern:
option("subscribe", "topic1,topic2")
option("subscribePattern", "topic.*")
If batch processing is more fitting to your use case, you can create a DataFrame or Dataset for offsets. This statement in PySpark subscribes to one topic, which is also the latest offsets by default:
df = spark \
.read \
.format("kafka") \
.option("kafka.bootstrap.servers", "yourhost:yourport") \
.option("subscribe", "topic1") \
.load()
Change the argument when you need to subscribe to a pattern:
.option("subscribePattern", "topic.*") \
.option("startingOffsets", "earliest") \
.option("endingOffsets", "latest") \
Conclusion
We hope we have introduced a great Spark Streaming with Kafka example so you can get the idea of this integration. The combination can help you process a growing amount of data that may come from numerous sources.
Leave a comment