Apache Kafka is a well known open-source stream processing platform which aims to provide a high-throughput, low-latency & fault-tolerant platform which is capable of handling real-time data input.
So what is it that makes Apache Kafka the go-to platform of choice when it comes to real-time data processing? Apart from all the other perks that Kafka provides, speed is one of the most important ones. Let us see how Kafka is built to be so fast.
1. Low-Latency I/O: There are two possible places which can be used for storing and caching the data: Random Access Memory (RAM) and Disk.
- An orthodox way to achieve low latency while delivering messages is to use the RAM. It's preferred over the disk because disks have high seek-time, thus making them slower.
- The downside of this approach is that it can be expensive to use the RAM when the data flowing through your system is around 10 to 500 GB per second or even more.
- It uses a data structure called 'log' which is an append-only sequence of records, ordered by time. The log is basically a queue and it can be appended at its end by the producer and the subscribers can process the messages in their own accord by maintaining pointers.
- The first record published gets an offset of 0, the second gets an offset of 1 and so on.
- The data is consumed by the consumers by accessing the position specified by an offset. The consumers save their position in a log periodically.
- This also makes Kafka a fault-tolerant system since the stored offsets can be used by other consumers to read the new records in case the current consumer instance fails. This approach removes the need for disk seeks as the data is present in a sequential manner as depicted below:

- Kafka, on the other hand, is not a database but a messaging system and hence it experiences more read/write operations compared to a database.
- Using a tree for this may lead to random I/O, eventually resulting in a disk seeks - which is catastrophic in terms of performance.