Data storage nodes (DataNode)

A Data node's primary role in a Hadoop cluster is to store data, and the jobs are executed as tasks on these nodes. The tasks are scheduled in a way that the batch job processing is done near the data by allocating tasks to those nodes which would be having the data for processing in most certainty. This also ensures that the batch jobs are optimized from execution perspectives and are performant with near data processing.

Please see the details and inner working of a typical Hadoop batch process here:

Figure 06: MapReduce in action

Here, we see that the job, when initiated, is divided into a number of mapper jobs. The number of mapper jobs spawned typically depends on the block size and the amount of data to be processed. From a job process perspective, one can always specify the maximum number of mapper jobs, however the number of mappers would always be limited by the maximum number of mappers specified. This is very helpful when we want to limit the amount of cores that can be utilized for batch jobs.

As stated before, the block size plays a vital role in the batch process since the unit of work for a mapper depends on the block size, and the job is distributed across the mappers as blocks of data as inputs.

At a high level, a typical batch job is executed in the following sequence:

  1. The job driver program, which sets the job context in terms of mapper, reducer, and data format classes, is executed.
  2. The mapper jobs are fed with blocks of data, as read from job execution.
  3. The output produced by the mapper is sorted and shuffled before it is fed to reducers.
  4. The reducer performs the reduce function on the intermediate data produced by mappers and stores the output back on HDFS as per data format definitions defined in the job driver program.

While this may seem to be very simple and straight forward, the actual job execution consists of multiple stages. At this point, we just want to provide a context for the Hadoop Batch processing hence limiting the information to a level that is required to understand the concept.

We will be discussing this subject again in much greater detail in later chapters on the batch layer.

The overall expectation of the batch layer in a Lambda Architecture is to provide high-quality, processed data that can be correlated with near-real-time processing of the speed layer, resulting in considerably dependable and consistent information reflected in near-real-time.