This post is about Apache Storm and How to install Apache Storm 0.9.4 (latest version) on Windows 7 & Ubuntu 14.04.2 with an Hello World program.

Overview of Apache Storm

Apache Storm is a free and open source distributed system for real-time computation. Storm is implemented in Clojure & Storm APIs are written in Java by Nathan Marz & team at BackType. Storm is best for distributed processing where unbounded streams of real time data computation requires. In contrast to batch systems like Hadoop, which often introduce delay of hours, Storm allows us to process online data.Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size.

About Nathan Marz

Nathan Marz was the lead engineer on Twitter's Publisher Analytics team. Previously Nathan was the lead engineer of BackType which was acquired by Twitter in July of 2011. On march 2013 he left Twitter to start-up his own company He is a major believer in the power of open source and has authored some significant open source projects, including Cascalog, ElephantDB, and Storm. He writes a blog at http://nathanmarz.com.

A Brief History of Apache Storm

In 2010 Apache Storm was a nothing than an idea of Nathan Marz, But now storm has been adopted by many of the world's largest companies including Yahoo!, Twitter, Microsoft and many more. Before Storm, BackType built an analytic product to help businesses understand their impact on social media both historically and real time using Queues and Workers approach. Oftentimes these workers would send messages through another set of queues to another set of workers for further processing. Where most portion of code are for sending/receiving messages and serializing/deserializing messages and with small portion of actual business logic.
In December 2010, Nathan Marz realized the complexity of product and come up with idea of "Stream" as distributed abstraction with Top-level abstraction "Topology", a network of spouts and bolts. Spout produces brand new streams and Bolt takes in streams as input by subscribe to whatever streams they need to process and produces streams as output. The key insight is that spout and bolt are inherently parallel. Then he tested this abstraction against his test case and he wanted to validate large set of use cases, so he tweeted as
"I'm working on new kind of stream processing system. If this sounds interesting to you, ping me. I want to learn your use cases"
Most of people responded with their use-cases, Then he Initially started designing Storm with RabbitMQ for Intermediate messages. Then without intermediate messages, Nathan Marz developed an algorithm based on random numbers and XORs that would only require 20bytes to track each spout tuple to solve massage processing guarantee. Unlike Hadoop Zombie processes which would not shut down at idle, Storm never has this Zombie process issue and Storm has been designed to be "Process Fault-Tolerant". Storm has bean open sourced on July 2011 and graduated to a top-level Apache project on September 17th, 2014.

Storm Application

A Storm application is designed as a topology in the shape of DAG(Directed ACyclic Graph) with Spouts and Bolts acting as graph vertices. Edges on the graph are data stream between nodes.Here are some typical “prevent” and “optimize” use cases for Storm.

“Prevent” Use Cases “Optimize” Use Cases
Financial   Services
  • Securities fraud
  • Operational risks & compliance violations
  • Order routing
  • Pricing
Telecom
  • Security breaches
  • Network outages
  • Bandwidth allocation
  • Customer service
Retail
  • Shrinkage
  • Stock outs
  • Offers
  • Pricing
Manufacturing
  • Preventative maintenance
  • Quality assurance
  • Supply chain optimization
  • Reduced plant downtime
Transportation
  • Driver monitoring
  • Predictive maintenance
  • Routes
  • Pricing
Web
  • Application failures
  • Operational issues
  • Personalized content

Characteristics of Storm

  • Fast – bench-marked as processing one million 100 byte messages per second per node
  • Scaleable – with parallel calculations that run across a cluster of machines
  • Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
  • Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
  • Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate

Use cases of Storm

  • Processing Streams (No need of intermediate Queues)
  • Continuous Computation
  • Distributed RPC

Spout

A Spout is a source of streams in a computation. Spout can read from queue like Kafka , can generate its own stream or read Twitter streaming.

Bolt

A bolt processes any number of input streams and produces any number of new output streams.Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.

Topology

  • A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed.
  • A topology can have one spout and multiple bolt
  • A topology is arrangement of spout and bolt

Understanding Storm Architecture

Storm cluster has 3 sets of nodes

1. Nimbus Node

  • Uploads computations for execution
  • Distributes code across the cluster
  • Launches workers across the cluster
  • Monitors computations and reallocates workers as needed

2. Zookeeper nodes

  • Coordinates the storm cluster
  • Zookeeper is a distributed coordination service going to installed on each and every machine
  • Job tracker
  • When any failure, zookeeper will assist to nimbus and nimbus will re invoke to supervisor

3. Supervisor nodes

  • Communicates with nimbus through zookeeper, starts and stops workers according to signals from nimbus.
  • Supervisor actual execution happens at supervisor and only start and stop signals one set by nimbus
  • Task Tracker

Storm topology

  • The work is delegated to different types of components that are each responsible for a simple specific processing task
  • Input stream of a storm cluster is handled by a component called a spout
  • The spout pass the data to a component called  a bolt which transforms it in some way
  • A boult either persists data in some sort of storage or pass it to some other bolt
  • Only one nimbus in a cluster, Single point of failure

Five Key abstractions help to understand how storm processes data

  • Tuple: An ordered list of elements eg: (7,5,6,9)
  • Stream: An unbounded sequence of tuples
  • Spouts: Source of streams in a computations
  • Bolts:Process input stream and produce output streams they can run functions filters aggregates or join data or talk to databases
  • Topology:overall calculation, represented visually as a network of spouts and bolts (as in the following diagram)

Nimbus

  • Distribute the code & lunch workers across the cluster
  • Real-time Analytics by Storm in Hadoop Framework
  • We can process the real time data with the help of queue streaming and Distributed Processing

How Apache Storm Works?

Consider we having a task like processing 2,00,000 lines text file on a machine, here to process a single record, it takes 6 seconds, so it will take(2,00,000*6) 12,00,000 seconds. Now we can see the real problem of long execution time.
How can we reduce the processing time? By creating two 1,00,000 lines text file and executing 2 processes, we can reduce the processing time to 6,00,000. So by parallel processing we can acheive this. But by doing more parallel processing on the single machine will not be effective. That is how distributed processing given solution to this problem by Running 10,000 lines text file on 20 machines, we can process the entire data in effecient time.

What is Lambda Architecture

Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. Find more details at http://lambda-architecture.net

Installing Apache Storm on Windows 7

Storm installation is not actually required to run on Local standalone mode of Storm submission
  1. Install Java, where note the java path does not contain any spaces
  2. Install Zookeeper by downloading from its official website
  3. Extract to c:\zookeeper
  4. set PATH environment variable for zookeeper bin path
  5. configure zoo.cfg
  6. start zookeeper
  7. Download storm and extract to storm at C:/storm
  8. set PATH environment variable for Storm bin path
  9. configure storm.yaml
For more reference on setting up storm cluster or installing storm on Ubuntu refer: https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html

Different Spouts

  • From DB
  • Queue
  • API
  • From text file

Different Tuple States

  • ASK-Storm fully processed
  • ACTIVATE-Deactivation to Activation
  • CLOSE-Spout Shutdown
  • DEACTIVATE-Deactivation
  • FAIL-Failed to process fully
  • NEXT TUPLE-request for next tuple emit to output collector
  • OPEN-Initialization