Data storage layer - store all data

The data storage layer is very eminent in the Lambda Architecture pattern as this layer defines the reactivity of the overall solution to the incoming event/data streams. As per the theory of connected systems, a system is only as fast as the slowest system in the chain. Hence, if the storage layer is not fast enough, the operations performed by the near-real-time processing layer would be slow, thus hampering the near-real-time nature of the architecture.

In the overall Lambda Architecture, there are broadly two kinds of active operations on the ingested data: Batch processing and Near-Real-Time processing. The data needs for batch and Near-Real-Time processing are very different. For instance, a batch mode, in most cases, would need serial read and serial write operations, for which a Hadoop storage layer may suffice. However, if we consider Near-Real-Time Processing, which would need quick lookups and quick writes, Hadoop storage may not be the right fit. For supporting Near-Real-Time processing, it is required that the data layer supports some kind of indexed data storage.

Typical specifications for a storage layer in a Lambda Architecture can be summarized as given here:

  • Must support both serial as well as random operations
  • Must be tiered based on the usage pattern with appropriate data solutions
  • Must be able to handle large volumes of data for both batch as well as near-real-time processing
  • Must be flexible and scalable for multiple data structure storage