Visualization of Kafka Stream
When I started working on the Kafka stream, It was not easy for me to visualize concepts, but the official documents helped me a lot. Now I have successfully deployed a stream application using Mesos and Marathon(It is a production-proven Apache Mesos framework for container orchestration, providing a REST API for starting, stopping, and scaling applications).
When I was writing the application, I thought I should share my knowledge with everyone.
So, let’s understand Kafka Stream first. If you know Java Stream(Java 8) then it will be easy to visualize the Kafka stream architecture. The stream keyword itself says A CONTINUOUS FLOW OF DATA in the world of information technology. Now think if you want a continuous flow of data, then you need a pipeline that will store your data in a continuous manner. So consider a pipeline having data in and out, but that pipeline can perform different-different action while processing the data or after consuming the data or before publishing it out.
I hope, now you have a broad picture in your mind about Stream(Kafka Stream). Consider the Data Source is nothing but Kafka data(events/message) and Data Target is some type of Database or Kafka again or some data storage system.
Now we have to think, what are the actions that can be done while data is in pipeline. You can get all this information from the official website. I will try to provide visualization about Data Source.
Please get the basic knowledge of Kafka, so that we can talk about partition, replication, brokers and Kafka threads. Let’s consider a scenario having Kafka with 3 brokers, 4 partitions and 3 replicas for each topic. Here, P-0, P-1, P-2, P-3 are different partitions for a topic “XYZ” and P-0-R-0, P-0-R-1 are replicas for partition P-0 and similar for other partitions.
In Kafka Stream, the Number of partitions is equal to the number of Kafka stream tasks. So, in our case, the number of tasks will be 4, which means a total of 4 pipelines will be there.
We have to understand one thing about Kafka Stream; It’s distributed and parallel processing application, so we can not achieve parallel processing without threads. So Kafka Stream allows you to decide how many threads you want to execute/run tasks.
What do you mean by Kafka Stream's tasks???
A task is nothing but a process/program to achieve your desired result.
How do you write a Kafka Stream’s task???
Kafka Stream provides you a way to create your task using Topology. Think of a topology as a graph or a connected data points.
A node from the topology/graph can perform an action that will take you to your desired result. So the topology is known as processor topology.
Following are actions which can be performed-
- Filter & FilterNot
- Branch & Process
- SelectKey
- Map & MapValues
- Through & To
- Flat Map & Flat Map Values
- Transform & TransformValues
- Merge & Peek
- GroupBy & GroupByKey
- ForEach
- Join, Left Join & Outer Join
We will try to understand the above-mentioned actions in later series, what is important right now to understand is; Each action is nothing but a node of processor topology.
Now a very important concept comes into the picture, which is the storage of pipeline data. Kafka Stream provides a way to store your intermediate or transformed data somewhere which is known as State Store which is implemented on top of Rocks DB (Cached DB developed by Facebook) by default, but you can create your own State store using some other database such as Redis, Aerospike or MongoDB or any other DB which is very efficient and fast. We will explore the depth of the State Store in the coming series. For now, we should figure out how Kafka Stream thread consumes data from the provided topic and perform different-different actions to achieve the desired result.
Let’s consider the above example of brokers, partitions, and replicas again. We had a total of 4 tasks needed to be executed, so we will start with 2 stream threads that will execute tasks. One interesting point to know is; all 4 tasks desired the same result and have the same topology means have exact actions to perform. All threads will consume event/data/message from the topic, perform all actions and forward/dump/push to some database or Kafka or file.
I think, now you have pretty much an idea about Kafka Stream. Wait for my next episode which will explain different types of actions and state stores. We will also try to understand the Security and Memory management and deployment process. There are some very important concepts that we must visualize such as KStream, KTable, GlobleKTable, Windowing, Aggregations.
Thanks for your help. I was overwhelmed by your responses to my last blog. Thanks again.
Follow me on Twitter for more blogs.