书名：Data Lake for Enterprises
作者名：Tomcy John Pankaj Misra
本章字数：243字
更新时间：2021-07-02 22:46:57

Data storage layer - store all data

The data storage layer is very eminent in the Lambda Architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. As per the theory of connected systems, a system is only as fast as the slowest system in the chain. Hence, if the storage layer is not fast enough, the operations performed by the near-real-time processing layer would be slow, thus hampering the near-real-time nature of the architecture.

In the overall Lambda Architecture, there are broadly two kinds of active operations on the ingested data: Batch processing and Near-Real-Time processing. The data needs for batch and Near-Real-Time processing are very different. For instance, a batch mode, in most cases, would need serial read and serial write operations, for which a Hadoop storage layer may suffice. However, if we consider Near-Real-Time Processing, which would need quick lookups and quick writes, Hadoop storage may not be the right fit. For supporting Near-Real-Time processing, it is required that the data layer supports some kind of indexed data storage.

Typical specifications for a storage layer in a Lambda Architecture can be summarized as given here:

Must support both serial as well as random operations
Must be tiered based on the usage pattern with appropriate data solutions
Must be able to handle large volumes of data for both batch as well as near-real-time processing
Must be flexible and scalable for multiple data structure storage