As the amount of data available for systems to analyze is increasing by the day, the need for newer faster ways to capture all this data in continuous streams is also arising. Apache Hadoop is possibly one of the most widely used frameworks for distributed storage and processing of Big Data data sets. And with the help of various ingestion tools for Hadoop, it is now possible to capture raw sensor data as binary streams.
Three of the most popular Hadoop ingestion tools include Flume, Kafka, and Kinesis. This post aims at discussing the pros and cons of using each tool – from initial capturing of data to monitoring and scaling.
Before we dive into this further, let us understand what a binary stream is. Most data that becomes available – user logs, logs from IoT devices, etc are streams of text events that are generated by some user action. This data can be broken into chunks based on the event that happened – the user clicks on a button, a setting change, and so on. A binary data stream is one in which instead of breaking down the data stream by events, the data is collected in a continuous stream at a specific rate. The ingest tools in question capture this data and then push out the serialized data to Hadoop.
Flume vs. Kafka vs. Kinesis:
Now, back to the ingestion tools. Both Flume and Kafka are provided by Apache whereas Kinesis is a fully managed service provided by Amazon.
Flume provides many pre-implemented sources for ingestion and also allows custom stream implementations. It provides two implementation patterns, Pollable source, and Event-Driven source. Which one you select depends on what best describes your use case. For scalability a Flume source hands-off messages to a channel. Multiple channels allow horizontal scaling as well.
Flume also allows configuring multiple collector hosts for continued availability in case of a collector failure.
Kafka is gaining popularity in the enterprise space as the ingestion tool to use. A streaming interface on Kafka is called a producer. Kafka also provides many producer implementations and also lets you implement your own interface. With Kafka, you need to build your consumer’s ability to plug into the data – there is no default monitoring implementation.
Scalability on Kafka is achieved by using partitions configured right inside the producer. Data is distributed across nodes in the cluster. A higher throughput requires more number of partitions. The tricky part of this could be selecting the right partition scheme. Generally, metadata from the source is used to partition the streams in a logical manner.
The best thing about Kafka is resiliency via distributed replicas. These replicas do not affect the throughput in any way. Kafka is also a hot favorite among most enterprises.
Kinesis is similar to Kafka in many ways. It is a fully managed service that integrates really well with other AWS services. This makes it easy to scale and process incoming information. Kinesis, unlike Flume and Kafka, only provides example implementations, there are no default producers available.
The one disadvantage Kinesis has over Kafka is that it is a cloud service. This introduces a latency when communicating with an on-premise source compared to the Kafka on-premise implementation.
So Which to Choose – Flume or Kafka of Kinesis:
The final choice of the ingestion tool really depends on your use case. If you want a highly fault-tolerant, DIY solution and can have developers for supporting it, Kafka is definitely the way to go. If you need something which is more out-of-the-box, use Kinesis or Flume. There again, choose wisely depending on how the data will be consumed. Kafka and Kinesis pull data whereas Flume pushes it out using something called data sinks.
There are other players as well like:
Apache Storm – also for data streaming but generally used for shorter terms, maybe an add-on to your existing Hadoop environment
Chukwa (a Hadoop subproject) – devoted to large scale log collection and analysis. It is built on top of HDFS and MapReduce and is highly scalable. It also includes a powerful monitoring toolkit
Streaming data gives a business the opportunity to identify real-time business value. Knowing the big players and which one works best for your use case is a great enabler for you to make the right architectural decisions.