This post describes the Flume working principle. Please find more information below.
How Flume works
Flume’s high-level architecture is built on a streamlined codebase that is easy to use and extend. The project is highly reliable, without the risk of data loss.
Flume also supports dynamic reconfiguration without the need for a restart, which reduces the downtime for its agents.
The following components make up Apache Flume:
Component | Definition |
Event | A singular unit of data that is transported by Flume (typically a single log entry) |
Source | The entity through which data enters into Flume. Sources either actively poll for data or passively wait for data to be delivered to them. A variety of sources allow data to be collected, such as log4j logs and syslogs. |
Sink | The entity that delivers the data to the destination. A variety of sinks allow data to be streamed to a range of destinations. One example is the HDFS sink that writes events to HDFS. |
Channel | The conduit between the Source and the Sink. Sources ingest events into the channel and the sinks drain the channel. |
Agent | Any physical Java virtual machine running Flume. It is a collection of sources, sinks and channels. |
Client | The entity that produces and transmits the Event to the Source operating within the Agent. |
Flume components interact in the following way:
1. a flow in Flume starts from the Client;
2. the Client transmits the Event to a Source operating within the Agent;
3. the Source receiving this Event then delivers it to one or more Channels;
4. one or more Sinks operating within the same Agent drains these Channels;
5. channels decouple the ingestion rate from the drain rate using the familiar producer-consumer model of data exchange;
6. when spikes in client side activity cause the data to be generated faster than can be handled by the provisioned destination capacity can handle, the Channel size increases. This allows sources to continue their normal operation for the duration of the spike;
7. the Sink of one Agent can be chained to the Source of another Agent. This chaining enables the creation of complex data flow topologies.
Flume’s distributed architecture requires no central coordination point. Each agent runs independently of others with no inherent single point of failure and it can easily scale horizontally.
Flume structure
1. The external structure of Flume:
As shown in the figure above, the data generated by the data generator (e.g. facebook, twitter) is collected by a single agent running on the server where the data generator is located, after which the data container collects data from each agent and collects it. Data is stored in HDFS or HBase.
2. Flume event
The event acts as the most basic unit of Flume's internal data transfer. It is a byte array of retransmitted data (the data set is passed from the data source access point and transmitted to the transmitter, ie HDFS/HBase) and a Choose the head composition.
A typical Flume event is shown in the following structure:
When we are going to privately customize the plugin, for example: the flume-hbase-sink plugin is to get the event and then parse it, and filter it according to the situation, and then transfer it to HBase or HDFS.
3. Flume Agent
After understanding the external structure of Flume, we know that Flume has one or more Agents inside. However, for each Agent, it is a separate daemon (JVM), which receives collections from the client, or from other Agents. Where to receive, and then quickly transfer the acquired data to the next destination node sink, or agent. The basic model of Flume as shown below:
The Agent is mainly composed of three components: source, channel and sink.
Source
Receive data from the data generator and pass the received data to one or more channels in Flume's event format. Flume provides multiple ways to receive data, such as Avro, Thrift, twitter1%, etc.
Channel
Channel is a short-lived storage container that caches the data in the event format received from the source until they are consumed by the sinks. It acts as a bridge between the source and the sink. Channel is a complete transaction. This guarantees the consistency of the data when it is sent and received. And it can be linked with any number of source and sink. Supported types are: JDBC channel, File System channel, Memort channel, etc.
Sink
The sink stores the data in centralized storage such as Hbase and HDFS, which consumes the events from the channels and passes them to the destination. The destination may be another sink, or HDFS, HBase.
An example of its combination:
That's all, thanks!