<> Application scenario
1. Information flow processing
Storm Can be used to process new data and update database in real time, Fault tolerance and scalability. Namely Storm It can be used to deal with the continuous flow of messages, Write the result to a storage after processing.
2. Continuous computation
Storm Continuous query and immediate feedback to client. For example Twitter Hot topics on send to browser.
3. Distributed remote call
Storm Can be used for parallel processing of intensive queries.Storm The topology of is a distribution function waiting for call information, When it receives a call message, The query will be calculated, And return the query results. For instance
Distributed RPC Can do parallel search or deal with large set of data.
<> Operation steps
<>1. Storm Summary
Storm It's a real-time, Distributed, Reliable streaming data processing system. Its job is to delegate various components to handle some simple tasks independently. stay Storm What processes the input flow in the cluster is
Spout assembly, and Spout And pass the read data to theBolt Components.Bolt The component processes the received data tuples, It's also possible to pass it on to the nextBolt. We can
Clusters are imagined as aBolt Chain set of components, Data is transmitted on these chains, andBolt Process data as nodes in the chain.
Storm Ensure that every message is processed, And it's very fast, In a small cluster, Millions of messages can be processed per second.Storm
The processing speed is amazing： Tested, Each node can process every second 100 10000 data tuples. Its main application areas are real-time analysis, Online machine learning, Continuous computation, Distributed
RPC（ Far procedure call protocol, A service request from a remote computer program over a network, Without understanding the protocol of underlying network technology.）,ETL（ Data extraction, Converting and loading） etc..
Storm and Hadoop Clusters look similar on the surface, however Hadoop It's running MapReduce Jobs, While inStorm Topology is running on
Topology, It's very different between the two, The key difference is：MapReduce It will end, And one Topology Will always run（ Unless you do it by hand kill
fall）, Let me put it another way,Storm Real time data analysis oriented, and Hadoop For offline data analysis,Storm stay HDP The location in is shown in the following figure.
<>2. Storm Cluster architecture
Storm Cluster of consists of one master node and multiple work nodes. The primary node runs a system named“Nimbus” Daemons for, Each work node runs a task named“Supervisor” Daemons for, The coordination between the two is carried out byZooKeeper
To finish,ZooKeeper For managing different components in a cluster,Storm The cluster architecture is shown in the figure below.
<>2.1 Master node Nimbus
The primary node usually runs a background program——Nimbus, Used to respond to nodes distributed in the cluster, Assign tasks and monitor faults, At a node
Supervisor After a breakdown, If the Worker Process aborted,Nimbus Will terminate abnormally Worker Process assigned to other
Supervisor Continue running on node, This is similar to Hadoop MediumJobTracker.
<>2.2 Work node Supervisor
Each working node is running on a platform called Supervisor process.Supervisor Monitor from Nimbus Tasks assigned to it, It can also ensure normal operation
Worker You can restart the Worker.Nimbus and Supervisor The coordination between them is through ZooKeeper system.
<>2.3 Coordination service component Zookeeper
ZooKeeper It is done. Nimbus and Supervisor
Coordinated services between. The real-time logic of the application is encapsulated inStorm Medium“topology”.Topology A set of Spout( data source) and
Bolts( data processing) adopt Stream Groupings Connecting diagram.
<>2.4 Work process Worker
Worker It is a Java process, Perform part of the topology. One Worker The process executes a Topology Subset, It will start one or more Executor
Thread to execute a Topology Components（Spout or Bolt）, Following chart.
<>3. Storm Use
One butstorm Task opening, So it's running all the time, Unless terminated manually.