The above question is tied to the following activity: Join the class of Big Data 101!
Hello, everyone!
In this post, I want to show the principles and architecture of Flink.
What is Flink?
Apache Flink is an open-source stream processing framework for distributed, high-performance stream processing applications. Flink not only supports real-time computing with both high throughput and exactly-once semantics, but also provides batch data processing.
Compared with other data processing engines in the market, Flink and Spark support both stream processing and batch processing. From the perspective of the technical philosophy, Spark simulates stream computing based on batch computing. Flink is the opposite. It simulates batch computing based on stream computing.
Key Concepts
Continuous processing of streaming data, event time, stateful stream processing, and state snapshots.
Core Ideas
The biggest difference between Flink and other stream computing engines is state management.
Flink provides built-in state management. You can store states in Flink instead of storing them in an external system. Doing this can:
Reduce the dependency of the computing engine on external systems, simplifying deployment and O&M.
Significantly improve performance.
Overall Architecture of Flink Runtime
DataStream and DataSet
DataStream
DataStream: represents streaming data. You can think of DataStream as immutable collections of data that can contain duplicates. The number of elements in DataStream are unlimited.
This section describes the operator operations between DataStreams.
1. Window operations are associated with each other using the reduce, fold, sum and max functions.
2. ConnectedStream connects streams using the flatmap and map functions.
3. JoinedStream performs the join operation between streams using the join function, which is similar to the join operation in the database.
4. Co GroupedStream associates streams using the coGroup function. It is similar to the group operation in a relational database.
5. KeyedStream processes data flows based on keys using the keyBy function.
DataSet
DataSet programs in Flink are regular programs that implement transformations on data sets (for example, filtering, mapping, joining, and grouping). The data sets are initially created by reading files or from local collections. Results are returned via sinks, which can write the data to (distributed) files, or to standard output (for example, the command line terminal).
Flink Program
Flink programs consist of three parts: Source, Transformation, and Sink. Source is responsible for reading data from data sources such as HDFS, Kafka, and text.
Transformation is responsible for data transformation operations. Sink is responsible for final data outputs (such as HDFS, Kafka, and text). Data that flows between parts is called stream.
Flink Program Composition
1. Obtain the execution environment. The StreamExecutionEnvironment is the basis of all Flink programs. You can create an execution environment in any of the following ways:
StreamExecutionEnvironment.getExecutionEnvironment
StreamExecutionEnvironment.createLocalEnvironment
StreamExecutionEnvironment.createRemoteEnvironment
2. Load/create the raw data.
3. Specify transformations on this data.
4. Specify where to put the results of your computations.
5. Trigger the program execution.
Flink Data Sources
1. Batch processing of data sources
File-based data. The file formats include HDFS, Text, CSV, Avro, and Hadoop input format.
JDBC-based data
HBase-based data
Collection-based data
2. Stream processing of data sources
File-based data source: readTextFile(path) reads a text file line by line based on the TextInputFormat read rule and returns the result.
Collection-based data source: romCollection() creates a data flow using a collection. All elements in the collection must be of the same type.
Flink Program Running
Flink adopts a master-slave architecture. If a Flink cluster is started, a JobManager process and at least one TaskManager process are started. After a Flink program is submitted, a client is created for preprocessing. The program is converted into a DAG that indicates a complete job and is submitted to JobManager. JobManager assigns each task in the job to TaskManager. The computing resources in Flink are defined by
task slots. Each slot represents a fixed-size resource pool of TaskManager. For example, a TaskManager with three slots evenly divides the memory managed by the TaskManager into three parts and allocates them to each slot. Slotting resources means that tasks from different jobs do not compete for memory. Currently, slots support only memory isolation and not CPU isolation.
When a Flink program is executed, it is first converted into a streaming dataflow. A streaming dataflow is a DAG consisting of a group of streams and transformation operators.
A user submits a Flink program to JobClient. JobClient processes, parses, and optimizes the program, and then submits the program to JobManager. TaskManager runs the task.
Flink Job Running Process
A user submits a Flink program to JobClient. JobClient processes, parses, and optimizes the program, and then submits the program to JobManager. TaskManager runs the task.
JobClient is a bridge between Flink and JobManager. It mainly receives, parses, and optimizes program execution plans, and submits the plans to JobManager. There are three types of operators in Flink.
Source Operator: data source operations, such as file, socket, and Kafka.
Transformation Operator: data transformation operations, such as map, flatMap, and reduce.
Sink Operator: data storage operations. For example, data is stored in HDFS, MySQL, and Kafka.
A Complete Flink Program
Flink Data Processing
Apache Flink supports both batch and stream processing that can be used to create a number of event-based applications.
Flink is first of all a pure stream computing engine with data stream as its basic data model. A stream can be infinite and borderless, which describes stream processing in the general sense, or can be a finite stream with boundaries, as in the case of batch processing. Flink thus uses a single architecture to support both stream and batch processing.
Flink has the additional strength of supporting stateful computing. Processing is called stateless when the result of processing an event or piece of data is only related to the content of that event itself. Alternatively, if the result is related to previously processed events, it is called stateful processing.
Bounded Stream and Unbounded Stream
Unbounded stream: Only the start of a stream is defined. Data sources generate data endlessly. The data of
unbounded streams must be processed continuously. That is, the data must be processed immediately after
being read. You cannot wait until all data arrives before processing, because the input is infinite and cannot
be completed at any time. Processing unbounded stream data typically requires ingesting events in a particular sequence, such as the sequence of the events that occurred, so that the integrity of the results can be inferred.
Bounded stream: Both the start and end of a stream are defined. Bounded streams can be computed after all data is read. All data in a bounded stream can be sorted, without ingesting events in an order. Bounded
stream processing is often referred to as batch processing.
Batch Processing Example
Batch processing is a very special case of stream processing. Instead of defining a sliding or tumbling window over the data and producing results every time the window slides, a global window is defined, with all records belonging to the same window. For example, a simple Flink program that counts visitors in a website every hour, grouped by region continuously, is the following:
If you know that the input data is bounded, you can implement batch processing using the following code:
If the input data is bounded, the result of the following code is the same as that of the preceding code:
Flink Batch Processing Model
Flink uses a bottom-layer engine to support both stream and batch processing.
Stream and Batch Processing Mechanisms
The two sets of Flink mechanisms correspond to their respective APIs (DataStream API and DataSet API). When creating a Flink job, you cannot combine the two sets of mechanisms to use all Flink functions at the same time.
Flink supports two types of relational APIs: Table API and SQL. Both of these APIs are for unified batch and stream processing, which means that relational APIs execute queries with the same semantics and produce the same results on unbounded data streams and bounded data streams.
The Table API and SQL APIs are becoming the main APIs for analytical use cases.
The DataStream API is the primary API for data-driven applications and pipelines.
Summary of HBase-related posts
Flink SQL Built-in functions Aggregate functions-FIRST_VALUE |
Flink SQL Built-in functions Aggregate functions-FIRST_VALUE |
That's all, thanks!