A Brief Intro of Apache Storm

This post is about Apache Storm and How to install Apache Storm 0.9.4 (latest version) on Windows 7 & Ubuntu 14.04.2 with an Hello World program.

Overview of Apache Storm

Apache Storm is a free and open source distributed system for real-time computation. Storm is implemented in Clojure & Storm APIs are written in Java by Nathan Marz & team at BackType. Storm is best for distributed processing where unbounded streams of real time data computation requires. In contrast to batch systems like Hadoop, which often introduce delay of hours, Storm allows us to process online data.Storm is extremely fast, with the ability to process over a million records per second per node on a cluster of modest size.

About Nathan Marz

Nathan Marz was the lead engineer on Twitter's Publisher Analytics team. Previously Nathan was the lead engineer of BackType which was acquired by Twitter in July of 2011. On march 2013 he left Twitter to start-up his own company He is a major believer in the power of open source and has authored some significant open source projects, including Cascalog, ElephantDB, and Storm. He writes a blog at http://nathanmarz.com.

A Brief History of Apache Storm

In 2010 Apache Storm was a nothing than an idea of Nathan Marz, But now storm has been adopted by many of the world's largest companies including Yahoo!, Twitter, Microsoft and many more. Before Storm, BackType built an analytic product to help businesses understand their impact on social media both historically and real time using Queues and Workers approach. Oftentimes these workers would send messages through another set of queues to another set of workers for further processing. Where most portion of code are for sending/receiving messages and serializing/deserializing messages and with small portion of actual business logic.

In December 2010, Nathan Marz realized the complexity of product and come up with idea of "Stream" as distributed abstraction with Top-level abstraction "Topology", a network of spouts and bolts. Spout produces brand new streams and Bolt takes in streams as input by subscribe to whatever streams they need to process and produces streams as output. The key insight is that spout and bolt are inherently parallel. Then he tested this abstraction against his test case and he wanted to validate large set of use cases, so he tweeted as

"I'm working on new kind of stream processing system. If this sounds interesting to you, ping me. I want to learn your use cases"

Most of people responded with their use-cases, Then he Initially started designing Storm with RabbitMQ for Intermediate messages. Then without intermediate messages, Nathan Marz developed an algorithm based on random numbers and XORs that would only require 20bytes to track each spout tuple to solve massage processing guarantee. Unlike Hadoop Zombie processes which would not shut down at idle, Storm never has this Zombie process issue and Storm has been designed to be "Process Fault-Tolerant". Storm has bean open sourced on July 2011 and graduated to a top-level Apache project on September 17th, 2014.

Storm Application

A Storm application is designed as a topology in the shape of DAG(Directed ACyclic Graph) with Spouts and Bolts acting as graph vertices. Edges on the graph are data stream between nodes.Here are some typical “prevent” and “optimize” use cases for Storm.

	“Prevent” Use Cases	“Optimize” Use Cases
Financial Services	Securities fraud Operational risks & compliance violations	Order routing Pricing
Telecom	Security breaches Network outages	Bandwidth allocation Customer service
Retail	Shrinkage Stock outs	Offers Pricing
Manufacturing	Preventative maintenance Quality assurance	Supply chain optimization Reduced plant downtime
Transportation	Driver monitoring Predictive maintenance	Routes Pricing
Web	Application failures Operational issues	Personalized content

Characteristics of Storm

Fast – bench-marked as processing one million 100 byte messages per second per node
Scaleable – with parallel calculations that run across a cluster of machines
Fault-tolerant – when workers die, Storm will automatically restart them. If a node dies, the worker will be restarted on another node.
Reliable – Storm guarantees that each unit of data (tuple) will be processed at least once or exactly once. Messages are only replayed when there are failures.
Easy to operate – standard configurations are suitable for production on day one. Once deployed, Storm is easy to operate

Use cases of Storm

Processing Streams (No need of intermediate Queues)
Continuous Computation
Distributed RPC

Spout

A Spout is a source of streams in a computation. Spout can read from queue like Kafka , can generate its own stream or read Twitter streaming.

Bolt

A bolt processes any number of input streams and produces any number of new output streams.Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.

Topology

A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed.
A topology can have one spout and multiple bolt
A topology is arrangement of spout and bolt

Understanding Storm Architecture

Storm cluster has 3 sets of nodes

1. Nimbus Node

Uploads computations for execution
Distributes code across the cluster
Launches workers across the cluster
Monitors computations and reallocates workers as needed

2. Zookeeper nodes

Coordinates the storm cluster
Zookeeper is a distributed coordination service going to installed on each and every machine
Job tracker
When any failure, zookeeper will assist to nimbus and nimbus will re invoke to supervisor

3. Supervisor nodes

Communicates with nimbus through zookeeper, starts and stops workers according to signals from nimbus.
Supervisor actual execution happens at supervisor and only start and stop signals one set by nimbus
Task Tracker

Storm topology

The work is delegated to different types of components that are each responsible for a simple specific processing task
Input stream of a storm cluster is handled by a component called a spout
The spout pass the data to a component called a bolt which transforms it in some way
A boult either persists data in some sort of storage or pass it to some other bolt
Only one nimbus in a cluster, Single point of failure

Five Key abstractions help to understand how storm processes data

Tuple: An ordered list of elements eg: (7,5,6,9)
Stream: An unbounded sequence of tuples
Spouts: Source of streams in a computations
Bolts:Process input stream and produce output streams they can run functions filters aggregates or join data or talk to databases
Topology:overall calculation, represented visually as a network of spouts and bolts (as in the following diagram)

Nimbus

Distribute the code & lunch workers across the cluster
Real-time Analytics by Storm in Hadoop Framework
We can process the real time data with the help of queue streaming and Distributed Processing

How Apache Storm Works?

Consider we having a task like processing 2,00,000 lines text file on a machine, here to process a single record, it takes 6 seconds, so it will take(2,00,000*6) 12,00,000 seconds. Now we can see the real problem of long execution time.

How can we reduce the processing time? By creating two 1,00,000 lines text file and executing 2 processes, we can reduce the processing time to 6,00,000. So by parallel processing we can acheive this. But by doing more parallel processing on the single machine will not be effective. That is how distributed processing given solution to this problem by Running 10,000 lines text file on 20 machines, we can process the entire data in effecient time.

What is Lambda Architecture

Lambda architecture describes a system consisting of three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries. Find more details at http://lambda-architecture.net

Installing Apache Storm on Windows 7

Storm installation is not actually required to run on Local standalone mode of Storm submission

Install Java, where note the java path does not contain any spaces
Install Zookeeper by downloading from its official website
Extract to c:\zookeeper
set PATH environment variable for zookeeper bin path
configure zoo.cfg
start zookeeper
Download storm and extract to storm at C:/storm
set PATH environment variable for Storm bin path
configure storm.yaml

For more reference on setting up storm cluster or installing storm on Ubuntu refer: https://storm.apache.org/documentation/Setting-up-a-Storm-cluster.html

Different Spouts

From DB
Queue
API
From text file

Different Tuple States

ASK-Storm fully processed
ACTIVATE-Deactivation to Activation
CLOSE-Spout Shutdown
DEACTIVATE-Deactivation
FAIL-Failed to process fully
NEXT TUPLE-request for next tuple emit to output collector
OPEN-Initialization

Harish Shan

Search This Blog