- Data Lake for Enterprises
- Tomcy John Pankaj Misra
- 393字
- 2021-07-02 22:46:58
Speed layer
The speed layer attributes to the near real time processing layer of Lambda Architecture, where the messages/data are processed as soon as they are ingested and the processed data is stored in the storage layer.
Since the primary need for the speed layer is to make data available in near real time, one has to ensure that the processing, storage and data availability meets the near real time expectation.
This would be possible only if the processing layer, storage layer, and serving layer are all operating at equal velocities to ensure that the data is not getting halted at any point in the flow.
Some of the initial streaming technologies used were Flume with HDFS, which did solve some part of the problem; however, it constrained the overall solution to having data converted to logs and these logs would get ingested into HDFS with almost no processing. The processing ultimately was done using batch processes that were not real time in nature.
It was soon realized that reliance on Hadoop batch processing would not fit into the expectation on near real time processing, hence there were separate frameworks built that specialized in near real time processing and these frameworks would constitute the speed layer in Lambda Architecture.
The initial frameworks were standalone frameworks, which did not integrate well into the Hadoop ecosystem; however, as there was more usage and maturity around these capabilities, it was evident that the systems needs to be integrated such that operability and manageability are simplified.
Figure 09: Near Real-Time Processing Pipeline
The mechanism implemented by these frameworks was same as that of Hadoop, that is, the Map-Reduce paradigm. However, it was implemented for real-time processing. Every framework had their own way of handling streaming data and resource management. Many of these frameworks were built on fast in-memory messaging capabilities which was a very effective way to decouple one component from another in real time processing yet have minimal latency.
The real-time processing was often dependent on data like the look-up data and reference data; hence there was a need to have a very fast data layer such that any look-up or reference data does not adversely impact the real-time nature of the processing. Here, some of the NoSQL technologies played that role well which we will discuss in later sections and chapters.