This is the first post in a blog series dedicated to Apache Kafka and its usage for solving problems in the Big Data domain. Using a hands-on approach and exploring the performance characteristics and limits of Kafka-based Big Data solutions, the series will make parallels with road racing. The reason for this is twofold. First, racing is a highly technical discipline where every tiny detail has to be mastered in order to win a race, and, in a highly distributed and scalable system that deals with a high data volume and/or velocity, the situation is pretty much the same. Second, well, it’s catchy and funny.
In racing, knowing the circuit and its corners is a must. It is not sufficient to be a highly skilled driver with a shiny engine - without mastering the circuit, you are doomed. The better you know the problem you’re solving the higher is the chance you’ll solve it and solve it in an efficient and robust way. The purpose of this post is to explore cuts and corners that a Kafka racing vehicle needs to conquer on a Big Data circuit.
It is not a rare situation that people who haven’t had a chance to explore Kafka and its potential use cases thoroughly, put Kafka in the first box (use case) they learned about and stop there. Like “Yeah, I know about Kafka - it’s for stream processing. Period.” Or “Yes, we know and use Kafka. It’s for implementing the ingress part of the system.”.
Kafka is a performant implementation of a general concept of distributed commit log and, as such, can be used in many places the log is required or can be used in order to implement an efficient solution. As it turned out, the commit log abstraction is present in distributed system’s internals or can be used in a larger number of cases than we tend to see at first. With Kafka, this simple structure that is known and used for so many years goes through its renascence.
To understand the nature of distributed commit log, one can start with reading from the famous Jay Kreps’ post.
But don’t forget: despite the fact that Kafka can offer great performance and efficient problem solutions, it is far from being a golden hammer - you cannot expect to win all circuits driving the same vehicle…although you may try…Sometimes you simply need another technology to solve the problem at hand.
Kafka Use Case Patterns
Kafka is built from the ground up as a distributed system, natively handling replication, fault-tolerance and partitioning. Kafka does a good job of persistence. The data in Kafka is always persisted and can be re-read. Kafka should be observed as a cluster, not just a collection of individual brokers. Such an approach has a substantial impact on everything from how you manage it to how Kafka-based applications behave.
The following few examples represents a list of rather use case patterns than use cases themselves. It is an attempt to group some of the most prominent use cases according to the place and function the parts implemented with Kafka have in system architecture.
Such a classification is certainly not a perfect one, and inevitably there are overlappings between concrete use cases, but I believe it can be useful to illustrate the types of Kafka usage and respective requirements and demands.
Kafka for Input/Output plumbing
In many instances there is a need to build an application or a system that needs to accept a large number of input messages. For example, there is a fire hose of external events, usually coming from multiple sources, that need to be accepted and processed in a reliable and performant way. The part of the system architecture responsible for this is usually said to implement the data ingestion or ingress path.
Fig 1 - Using Kafka as Ingress Layer
The most recognized use case for this are applications from the IoT domain. One of the primary requirements of these applications is to accept messages coming from a large number of devices in real time. Each message is usually small in size (e.g. up to few kilobytes) but they can vary significantly in format, frequency and number.
Web applications that need to collect a huge number of user events are yet another very present use case. For example, collecting click events that are being generated by ever-increasing number of web page visitors.
In fact, I would argue that practically any modern, large scale application has a need for a well implemented ingestion part. There are more reasons for this claim.
First, the non-functional requirements that these applications need to satisfy. For example, high availability (high uptime or no downtime) requirement practically means that the application needs to be able to accept input data virtually without loss (IoT devices usually cannot re-send event messages, all web page users’ actions have to be captures, etc.) even when internal system’s problems exist (e.g. servers are down because of maintenance or a malfunction).
Another reason for the ingestion part would be the need to simplify the application implementation. This practically means to avoid processing of input messages directly by application business logic handlers. Instead of having handlers responsible for accepting input data, delegating sub-requests to other internal services and sending the response back, the whole process is decomposed into steps, by applying the Command and Query Responsibility Segregation (CQRS) pattern, where the first step, the commands acceptance, becomes the ingestion part of the system.
Kafka topic can be efficiently used for the data ingestion implementation. The scalable nature of Kafka brokers supports use cases with a high throughput (a large number of events/messages to be accepted) and the need for increasing it. The concept of durable Kafka topics (persisted messages) and the isolation between message publishers and consumers implicitly eliminates the back-pressure issue as Kafka can buffer sudden spikes of the incoming message rate while consumers can maintain their own processing pace. In case consumers are down for any reason, durable topics can keep accepting incoming messages without loss and the consumers, once they are back online, can continue processing data where they left.
Similarly to the data ingress path, the data egress (output) path is prominently present in certain types of applications. The need to isolate a system’s inner works (or parts) from the outside clients/consumers makes the egress path a standalone/distinctive component of the system.
Fig 2 - Using Kafka as Egress Layer
For example, a situation with clients that process output messages at different and variable speed can cause back-pressure issue for the application. Also, clients that start consuming data at different times, that is, the need to store data for later consumptions, can be another issue for the application.
The way Kafka topics implement the publish/subscribe messaging pattern allows for a number of clients (consumers) to consume the output messages. The important thing here is that different clients consume messages independently and at their own pace. They can start at different times. In case clients lose some messages they can request from Kafka to replay it. Durable Kafka topics can solve the problem with slow clients, acting as a message buffer and solving the back-pressure problem.
Kafka as Data Backbone
Modern and complex systems usually have to cope with large amounts of data but at the same time provide scalability, short downtime and failure resilience, while their architecture has to remain flexible in order to support an easy evolution (e.g. new business requirements, application of new technologies).
One intrinsic characteristic of such systems is that the same data is being used through many access patterns. For example, one process accesses data in a serial manner (as time series) while another one needs to index data and access it randomly. This implies that different technologies need to be used at the same time to support different access patterns. Example of this would be usage of different databases - a relational database (e.g. Postgres) for ad-hoc BI, a NoSQL storage (e.g. Apache Cassandra) for time series, and Elastic for indexing and random access. Having multiple storage subsystems that operate on the same data and keeping it all in synchronism is a tough challenge.
The next important characteristic is that all future data handling strategies usually cannot be foreseen and, therefore, planned in the beginning. New data handling techniques, algorithms and/or business use cases appear after the system architecture is already implemented and data accepted and saved. So the original data/information has to be preserved and there has to be a mechanism that allows data reprocessing.
The third characteristic is that, over time, new system components are developed, using new technologies. They have to be tested and used to replace old versions, without breaking the production system. In case of an error, a rollback path has to be ensured. Moreover, all of this is usually expected without service downtime.
The system layer that is capable of supporting the above demands is usually called Central Data Pipeline. SSOT pattern (Single Source of Truth) can be also implemented using such an approach. Arguably it can also be seen as a variant of Data Lake (Data Pond?) implementation.
Fig 3 - Using Kafka as Central Data Pipeline
Apache Kafka has actually been created initially as an implementation of a central data pipeline in LinkedIn. Apparently, that’s why Kafka is a natural fit for this kind of use cases.
Kafka topics are designed to persist all data and allow for later data re-processing (data replay on the consumer side). At the same time, a single Kafka topic can be consumed by multiple consumers independently. This means that consumers can consume the topic data at their own pace and at any time.
A system can use Kafka topics to store input (original/raw) data and then use connectors (as Kafka consumers) that can process and store the data in multiple warehouse systems/databases. If a new database is added later, it can simply replay the data from the beginning and catch up with the rest of the system.
If a new data strategy is to be introduced, again, the new process implementation has to be connected to respective Kafka topics, without having any impact on the existing infrastructure while replaying data and applying new algorithms.
The same applies to introducing new versions of the existing systems. They can run in parallel with the existing infrastructure in a test phase and later on, if everything is well, they can take over the function from the old implementation.
As with everything in the real world, there are some drawbacks. Using Kafka topics for huge files in not a viable approach. There has to be an external storage (e.g. AWS S3) that can accept binary data, while a Kafka topic can be used to store references to it.
Another issue is related to scaling. Kafka splits logical topics to partitions so partitions are the basic unit of scaling. One logical topic can be spread across many Kafka brokers where each broker handles one or more topic partitions but a single partition cannot be handled by two or more brokers at the same time. This puts a scaling limit as there is no point in having more brokers than topic partitions, from the standpoint of scaling out a single topic. So a solid capacity planning is needed. Repartition of data is certainly possible, and using Kafka topics and data re-processing can be helpful for that, but it is still a costly operation.
Kafka to bind them all
Splitting complex systems into smaller, loosely-coupled building blocks, microservices, is a popular architectural approach, known for its many benefits. Every microservice can be seen as an independent, encapsulated unit with well defined and limited responsibility. It accepts some input data, over its one or more input APIs, it processes data/messages, according to its business logic, maintains its internal states, and produces some output data/messages via one or more output APIs. Sounds like a perfect world, doesn’t it?
But there’s a catch. In order to implement complex systems’ logic, the number of microservices can be pretty high. In some cases, an intent to limit a microservice's responsibility to minimum can lead to something called ‘nanoservice architecture’ with a large number of tiny building blocks. Having such a big number of interconnected components that communicate between each other, creates a mash, or spaghetti, if you may, that is nearly impossible to maintain and debug.
Fig 4 - Using Kafka to decouple microservices
The usual remedy to this spaghetti situation is the introduction of a messaging system that decouples microservices by using messages (instead of direct API calls) to implement input and output APIs.
The concept of durable Kafka topics, with guaranteed message delivery semantics is a perfect fit for this use case. A Kafka topic can be used as a message pipe that connects two or more microservices in different topologies combining publish/subscribe and fan-out message patterns.
Its durability, delivery semantic and guaranties, and message replay-ability makes it possible to run connected microservices independently. That is, both message consumer and message publisher can be stopped at any time without any problem. This also means that one or a group of microservices can be tested effortlessly, simply by publishing test messages to input topics and consuming output results from output topics.
Replicated Kafka topics, combined with the fan-out message pattern can be used to easily scale the system capacity (throughput) simply by increasing the number of microservice instances.
Moreover, Kafka topics with auto-compactions and/or event-sourcing pattern could be efficiently used for saving microservices’ internal states. These states for one or even a whole subtree of connected microservices can be recreated by replaying messages.
Kafka to stream it all
Stream processing of real-time events became an essential part of all modern applications. The need to react on real-life events in real time (e.g. fast reaction upon user actions or response to system metrics changes) makes stream processing such an important part.
Fig 5 - Using Kafka for stream processing applications
From its foundation, Kafka has been designed to support distributed stream processing as a layer on top of its core primitives. Basically Kafka provides a necessary infrastructure (logistic, if you want) for supporting data streams implementation that can be processed in a distributed and scalable manner, providing a highly-available, performant and fault-tolerant solution. This is the reason why it is used with so many popular stream-processing frameworks and libraries such as Apache Samza, Apache Flink, Apache Spark Streaming and Apache Storm. There is also a Kafka native library (or API), Apache Kafka Streams, that is part of the Apache Kafka project.
Many use cases, that can be considered a fit for Kafka, consist of stage-wise processing of data where stream data is consumed from one Kafka topic (e.g. raw data) and then aggregated, enriched, or otherwise transformed into new Kafka topics for further consumption. Kafka Streams library is particularly suitable for stateful and staged implementations like these.
Although the end-to-end latency in Kafka-based solutions is relatively low, for example in comparison to Amazon AWS Kinesis, it is not a solution for really low latency scenarios where an event has to be processed within a few milliseconds. For example, typical end-to-end latency ranging from a few dozen up to a few hundred milliseconds.
Here we presented some of the most prominent use cases where Kafka is the right tool. They illustrate the claim from the beginning of this post that Kafka is not a tool for a single use case and that applying it successfully for one problem does not exclude other applications of it within even the same system/application. Certainly this is not an exhaustive list of cases where Kafka can be used.
In the upcoming posts, we will cover the presented use cases with a more hands-on approach, in order to address implementation details and to display real-case metrics.
And to conclude, although Kafka is a great tool, it is just a tool. To solve a problem, the paramount requirement is that you understand it deeply. Even if that reveals the fact that Kafka is not a solution - especially then. And if it is, just as in racing, to master a circuit, one simply must know all its cuts and corners by heart.