Apache Flink

Flink as a framework overcomes these limitations of Spark streaming also supports exactly once processing which good consistency. It processes data iteratively row by row and is not limited by constraints of micro-batching as in the case of Spark streaming. It also supports time based windowing functions that are very helpful while performing event correlations, while keeping the processing pipeline very flexible and scalable.

The primary feature of Flink which makes it different and very suitable for iterative processing is generally attributed to its near-real-time processing capability. However, it also supports batch processing. Some of the important features of Flink are as follows:

  1. Exactly once processing makes it a reliable candidate for performing accurate aggregations while the streams are processed. This is generally not the case with Flume. It also supports checkpoint mechanism to keep it tolerant with respect to failures as well.
  2. Out of order processing is supported which provides excellent capability in the streaming layer to have the processing done in the expected order with respect to event timelines. In a typical multi-threaded environment, it is very obvious that the events may arrive out of order to downstream systems. This is further elaborated here:

Figure 10: Out of order scenario

  1. Provides out-of-the-box windowing capability for a streamed event, not only on the basis of event time but also on the basis of counts and sessions. This is particularly useful when such events need to be categorized/grouped together.
  2. Failure recovery is supported with no data loss, with a very light-weight process state management such that the processing is resumed from the point of failure:

Figure 11: Apache Flink failure recovery

  1. Flink is proven for high-throughput and low-latency processing. As mentioned earlier, since it is not dependent on micro-batching constraints, latency is very low compared to Spark Streaming and it happens to be the most appropriate near-real-time event processing framework.
  2. It works with YARN and MESOS as resource managers, and scheduling event processing on the available nodes and for failure recoveries.

Flink is designed and implemented to be run on a large node cluster. It also supports standalone cluster deployment with dynamic pipeline processing, as shown in this sample execution:

Figure 12: Flink stream processing

If we look at the overall Flink architecture, it is built to support both bounded as well as unbounded dataset processing, with APIs supporting both the modes. An architecture layout as depicted on flink.apache.org can be seen here:

Figure 13: Apache Flink architecture

When we refer to bounded and unbounded datasets, we are typically referring to batch and stream processing respectively. Since Flink is fundamentally a stream processing framework, it can also perform batch operations effectively as batch data is nothing but a subset of streaming data. Any near-real-time framework in general can be leveraged for batch processing. But it is not true the other way round; that is, pure batch based processing such as a Hadoop MapReduce batch process cannot perform the role of a stream processing framework since its capabilities are built for batch processing, which cannot be used for stream processing. Even if we minimize the interval between various batch jobs, it will always have a lag to prepare, process, and load the results of the batch process.

Here are the key differences between all the three frameworks that we have discussed in this chapter, namely Flume, Spark Streaming, and Apache Flink:

In addition to parallel processing, complex event processing has also been used in near-real-time processing very effectively, along with Natural Language Processing and machine learning. Some of these algorithms are appropriate for near-real-time execution, while many are not. Any latency in processing affects the overall processing time frames since the such processing is as slow as the slowest component in the component orchestration.

One of the other areas that does greatly influence the throughput is the data compressions that play a vital role in processing speeds. While compression in a Remote Procedure Call (RPC) may seem to be an overhead from the processing perspective, they do save on costly IO operations and can provide considerable performance gains across near-real-time processing. It is important to have the right compression codec supported for such processing. For instance, a simple ZIP-based compression may introduce more lags than performance gains since it is a transformation-based compression and does not support parallel compression techniques. In this space SNAPPY and LZO are more suitable compression codecs that can provide required performance gains. However, the choice of these codecs also depends on the support provided by the parallel processing framework being used.

The output of the Speed Layer is captured generally in the serving layer having high performance data repositories. Some of examples would be HBase, Elasticsearch, in-memory cache, and so on. Since this layer perform near-real-time processing, these data technologies also provide viable means for quick lookup and for reference data purposes.