Blog

In big-data circles, the Lambda Architecture has become a well established practice. In short, when faced with a data stream, Lambda architecture prescribes that we process the data using two parallel paths:

  • A “store the data and batch process it later” path – the slow lane
  • A near-realtime processing path – the fast lane

It’s usually represented something like this:

 

The driving idea here is that your full analytics processing is probably too much to accomplish in near-realtime, so you implement a lightweight/simplified set of processing in the fast lane, and your full application logic goes in the slow lane.

There are certainly some cases where this is necessary.   For example, some Machine Learning algorithms are amenable to online learning, where the model is updated incrementally as new data arrives, but many are not.   In those cases where the model must be rebuilt in batch against the whole dataset, a Lambda-style architecture is unavoidable.

What’s flawed about all this, at least from my POV, is that presumption that a slow lane is necessary.    Streaming data engines have gotten much more capable over the years.   Starting with Apache Storm, CEP-style stream processing got much more practical.  Then with Spark Streaming, it became possible to share code between batch and streaming processing, which only added fuel to the Lambda fire.

In practice I like an attitude I’d call Streaming-First.   To the extent that it’s practical, implement all of your data processing in the fast lane – I believe that more can be achieved on a streaming basis than people tend to think.

Second, it should almost never be necessary to do a “light” and “complete” version of the same processing as prescribed in some versions of the Lambda story.   Instead try this:  Process, enrich, and store the data in near-realtime, and if there is work that is prohibitively slow to do while streaming then move only that part to workflow-triggered batch jobs that follow behind & operate on the already enriched data.  This reduces complexity and prepares you well for a future where pretty much everything will be done on a streaming basis.

Perhaps I should introduce myself, mmmmh?

I’m Jeff Nadler, and I have been developing software for companies in the Pacific Northwest for 20 years now. In the early days of my career I did applications for traditional relational DBs like Oracle and Sybase, with programming in C / C++ / Java, and some data warehousing projects using ROLAP star-schema and MOLAP tools.

For the past 7 years or so I’ve been focused on scalable distributed systems based on NoSQL data stores and “big data” systems. I wish there was a better term than “big data” that adequately communicated the concept, but it’s reasonably easy to define: Big Data is any dataset that is hopelessly large to consider processing on a single server (or cloud instance).

As a result, you will need a distributed system to process the data, and that brings some new challenges but also the opportunity to architect a system that can grow with your business by adding servers (‘horizontal scaling’).

Lately I’ve been working on projects that emphasize streaming big data or “fast data”, where a nonstop stream of data arrives on a scalable message queue like Kafka or AWS’s Kinesis. From there the data is processed using applications that run on Storm, Spark, Flink, or Google Cloud DataFlow and finally the data is persisted to a suitable data store.

Here’s what the architecture looks like in general:

 

 

Of course even with the right architecture for the project, there are still lots of problems to solve.    And lots of posts to come-