Lambda architecture

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate views of batch data, while simultaneously using real-time stream processing to provide views of online data. The two view outputs may be joined before presentation. The rise of lambda architecture is correlated with the growth of big data, real-time analytics, and the drive to mitigate the latencies of map-reduce 

Lambda architecture consists of three levels:

  1. The batch layer is used for: (i) managing the master dataset (an immutable, append-only set of raw data), and (ii) to pre-compute the batch views.
  2. The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
  3. The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.

Each of these layers can be realized using various big data technologies. For instance, the batch layer datasets can be in a distributed filesystem, while MapReduce can be used to create batch views that can be fed to the serving layer. The serving layer can be implemented using NoSQL technologies such as Casandra, while querying can be implemented by technologies such as Apache Drill or Impala. Finally, the speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming.

How and where does it work?

Lambda Architecture is a useful framework to think about designing big data applications, for example distributed data processing systems at Twitter.

Unified Lambda

The downside of λ is its inherent complexity. Keeping  in sync two already complex distributed systems is quite an implementation and maintenance challenge. People have started to look for simpler alternatives that would bring just about the same benefits and handle the full problem set. There are basically three approaches:

1) Adopt a pure streaming approach, and use a flexible framework such as Apache Samza to provide some type of batch processing. Although its distributed streaming layer is pluggable, Samza typically relies on Apache Kafka. Samza’s streams are replayable, ordered partitions. Samza can be configured for batching, i.e. consume several messages from the same stream partition in sequence.

2) Take the opposite approach, and choose a flexible Batch framework that would also allow micro-batches, small enough to be close to real-time, with Apache Spark/Spark Streaming or Storm’s Trident. Spark streaming is essentially a sequence of small batch processes that can reach latency as low as one second.Trident is a high-level abstraction on top of Storm that can process streams as small batches as well as do batch aggregation.

3) Use a technology stack already combining batch and real-time, such as Spring “XD”, Summingbirdor Lambdoop. Summingbird (“Streaming MapReduce”) is a hybrid system where both batch/real-time workflows can be run at the same time and the results merged automatically.The Speed layer runs on Storm and the Batch layer on Hadoop, Lambdoop (Lambda-Hadoop, with HBase, Storm and Redis) also combines batch/real-time by offering a single API for both processing paradigms:

lambdoop

The integrated approach (unified λ) seeks to handle Big Data’s Volume and Velocity by featuring a hybrid computation model, where both batch and real-time data processing are combined transparently. And with a unified framework, there would be only one system to learn, and one system to maintain.