Got it

HCIA-Big Data | Introduction to Flink

Latest reply: Jul 27, 2022 11:28:07 1289 94 39 0 0

First question

The above question is tied to the following activity: Join the class of Big Data 101!



Hello, everyone!

In this post,  I want to show the principles and architecture of Flink.

What is Flink?

Apache Flink is an open-source stream processing framework for distributed, high-performance stream processing applications. Flink not only supports real-time computing with both high throughput and exactly-once semantics, but also provides batch data processing.

Compared with other data processing engines in the market, Flink and Spark support both stream processing and batch processing. From the perspective of the technical philosophy, Spark simulates stream computing based on batch computing. Flink is the opposite. It simulates batch computing based on stream computing.

Key Concepts

Continuous processing of streaming data, event time, stateful stream processing, and state snapshots.

flink

Core Ideas

The biggest difference between Flink and other stream computing engines is state management.

Flink provides built-in state management. You can store states in Flink instead of storing them in an external system. Doing this can:

  • Reduce the dependency of the computing engine on external systems, simplifying deployment and O&M.

  • Significantly improve performance.

Overall Architecture of Flink Runtime

Overall Architecture of Flink Runtime

DataStream and DataSet 

DataStream

DataStream: represents streaming data. You can think of DataStream as immutable collections of data that can contain duplicates. The number of elements in DataStream are unlimited.

DataStream

This section describes the operator operations between DataStreams. 

1.  Window operations are associated with each other using the reduce, fold, sum and max functions. 

2.  ConnectedStream connects streams using the flatmap and map functions. 

3.  JoinedStream performs the join operation between streams using the join function, which is similar to the join operation in the database. 

4.  Co GroupedStream associates streams using the coGroup function. It is similar to the group operation in a relational database. 

5.  KeyedStream processes data flows based on keys using the keyBy function.

DataSet

DataSet programs in Flink are regular programs that implement transformations on data sets (for example, filtering, mapping, joining, and grouping). The data sets are initially created by reading files or from local collections. Results are returned via sinks, which can write the data to (distributed) files, or to standard output (for example, the command line terminal).

Flink Program

Flink programs consist of three parts: Source, Transformation, and Sink. Source is responsible for reading data from data sources such as HDFS, Kafka, and text. 

Transformation is responsible for data transformation operations. Sink is responsible for final data outputs (such as HDFS, Kafka, and text). Data that flows between parts is called stream.

Flink Program

Flink Program Composition

1. Obtain the execution environment. The StreamExecutionEnvironment is the basis of all Flink programs. You can create an execution environment in any of the following ways: 

StreamExecutionEnvironment.getExecutionEnvironment 

StreamExecutionEnvironment.createLocalEnvironment 

StreamExecutionEnvironment.createRemoteEnvironment 

2. Load/create the raw data. 

3. Specify transformations on this data. 

4. Specify where to put the results of your computations. 

5. Trigger the program execution. 

Flink Program

Flink Data Sources 

1.  Batch processing of data sources 

  • File-based data. The file formats include HDFS, Text, CSV, Avro, and Hadoop input format.

  • JDBC-based data

  • HBase-based data

  • Collection-based data 

2. Stream processing of data sources 

  • File-based data source: readTextFile(path) reads a text file line by line based on the TextInputFormat read rule and returns the result.

  • Collection-based data source: romCollection() creates a data flow using a collection. All elements in the collection must be of the same type. 

Flink Program Running

Flink Program Running Diagram

Flink adopts a master-slave architecture. If a Flink cluster is started, a JobManager process and at least one TaskManager process are started. After a Flink program is submitted, a client is created for preprocessing. The program is converted into a DAG that indicates a complete job and is submitted to JobManager. JobManager assigns each task in the job to TaskManager. The computing resources in Flink are defined by 

task slots. Each slot represents a fixed-size resource pool of TaskManager. For example, a TaskManager with three slots evenly divides the memory managed by the TaskManager into three parts and allocates them to each slot. Slotting resources means that tasks from different jobs do not compete for memory. Currently, slots support only memory isolation and not CPU isolation. 

When a Flink program is executed, it is first converted into a streaming dataflow. A streaming dataflow is a DAG consisting of a group of streams and transformation operators. 

A user submits a Flink program to JobClient. JobClient processes, parses, and optimizes the program, and then submits the program to JobManager. TaskManager runs the task. 

Flink Job Running Process 

A user submits a Flink program to JobClient. JobClient processes, parses, and optimizes the program, and then submits the program to JobManager. TaskManager runs the task.

Flink Job Running Process

JobClient is a bridge between Flink and JobManager. It mainly receives, parses, and optimizes program execution plans, and submits the plans to JobManager. There are three types of operators in Flink.

  • Source Operator: data source operations, such as file, socket, and Kafka.

  • Transformation Operator: data transformation operations, such as map, flatMap, and reduce.

  • Sink Operator: data storage operations. For example, data is stored in HDFS, MySQL, and Kafka.

A Complete Flink Program

A Complete Flink Program

A Complete Flink Program

Flink Data Processing

Apache Flink supports both batch and stream processing that can be used to create a number of event-based applications.

  • Flink is first of all a pure stream computing engine with data stream as its basic data model. A stream can be infinite and borderless, which describes stream processing in the general sense, or can be a finite stream with boundaries, as in the case of batch processing. Flink thus uses a single architecture to support both stream and batch processing.

  • Flink has the additional strength of supporting stateful computing. Processing is called stateless when the result of processing an event or piece of data is only related to the content of that event itself. Alternatively, if the result is related to previously processed events, it is called stateful processing.

Bounded Stream and Unbounded Stream

Unbounded stream: Only the start of a stream is defined. Data sources generate data endlessly. The data of

unbounded streams must be processed continuously. That is, the data must be processed immediately after

being read. You cannot wait until all data arrives before processing, because the input is infinite and cannot

be completed at any time. Processing unbounded stream data typically requires ingesting events in a particular sequence, such as the sequence of the events that occurred, so that the integrity of the results can be inferred.

Bounded stream: Both the start and end of a stream are defined. Bounded streams can be computed after all data is read. All data in a bounded stream can be sorted, without ingesting events in an order. Bounded

stream processing is often referred to as batch processing.

Bounded Stream and Unbounded Stream

Batch Processing Example

Batch processing is a very special case of stream processing. Instead of defining a sliding or tumbling window over the data and producing results every time the window slides, a global window is defined, with all records belonging to the same window. For example, a simple Flink program that counts visitors in a website every hour, grouped by region continuously, is the following:

Batch Processing Example

If you know that the input data is bounded, you can implement batch processing using the following code:

Batch Processing Example

If the input data is bounded, the result of the following code is the same as that of the preceding code:

Batch Processing Example

Flink Batch Processing Model

Flink uses a bottom-layer engine to support both stream and batch processing.

Flink Batch Processing Model

Stream and Batch Processing Mechanisms

The two sets of Flink mechanisms correspond to their respective APIs (DataStream API and DataSet API). When creating a Flink job, you cannot combine the two sets of mechanisms to use all Flink functions at the same time.

Flink supports two types of relational APIs: Table API and SQL. Both of these APIs are for unified batch and stream processing, which means that relational APIs execute queries with the same semantics and produce the same results on unbounded data streams and bounded data streams.

  • The Table API and SQL APIs are becoming the main APIs for analytical use cases.

  • The DataStream API is the primary API for data-driven applications and pipelines.

Summary of HBase-related posts

[FI Components] Basic principle of Flink

[FI Components] HA Solution Description for Flink

[FI components Log] Flink Component Log Introduction

Flink Basic Concept

FLINK-DDL statements

Flink SQL-Basic concepts

Flink SQL-keywords

Flink SQL-Data type conversion

Flink SQL -DDL statements

Flink SQL - DML statement

Flink SQL - Query statements

Flink SQL - Query statements

Flink SQL-Create a data view

Flink SQL --Window functions

Flink SQL --Built-in functions

Flink SQL Built-in functions

Flink SQL -UDX

Flink SQL Built-in functions --Aggregate functions

Flink SQL Built-in functions CONCAT_AGG

Flink SQL Built-in functions Aggregate functions-COUNT

Flink SQL Built-in functions Aggregate functions-FIRST_VALUE

Flink SQL Built-in functions Aggregate functions-FIRST_VALUE

Flink SQL Built-in functions Aggregate functions-last value

Flink SQL Built-in functions Aggregate functions-MAX

Flink SQL Built-in functions Aggregate functions-MIN

Flink SQL Built-in functions Aggregate functions-SUM

Flink SQL Built-in functions Aggregate functions-VAR_POP

Flink SQL Built-in functions Aggregate functions-STDDEV_POP

That's all, thanks!

The post is synchronized to: HCIA-Big Data

  • x
  • convention:

wissal
MVE Created Apr 18, 2022 16:24:07

I really appreciate your effort, thanks a bunch
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 19, 2022 01:26:49 (0) (0)
Thanks for your support, wissal!  
VinceD
Moderator Created Apr 21, 2022 01:54:12

interesting content.
View more
  • x
  • convention:

VinceD
VinceD Created Apr 21, 2022 01:54:38 (0) (0)
 
NTan33
Created May 10, 2022 01:47:53

Quite a comprehensive post.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created May 17, 2022 01:41:21 (0) (0)
Thanks!  
DragonVN
Created May 10, 2022 02:13:17

Great one
View more
  • x
  • convention:

JohnTr
Created May 10, 2022 04:02:29

Thanks for sharing
View more
  • x
  • convention:

olive.zhao
olive.zhao Created May 17, 2022 01:41:35 (0) (0)
 
thibay
Created May 10, 2022 05:22:56

Good share
View more
  • x
  • convention:

MahMush
Moderator Author Created May 15, 2022 09:41:19

Flink is a Java-friendly stream processing framework.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created May 17, 2022 01:41:42 (0) (0)
 
SANJ
MVE Created Jul 18, 2022 10:47:43

Q1) Answer: When creating a Flink job, you cannot combine the two sets of mechanisms to use all Flink functions at the same time.
View more
  • x
  • convention:

wissal
MVE Created Jul 18, 2022 10:50:09

When creating a Flink job, you cannot combine the two sets of mechanisms to use all Flink functions at the same time.
View more
  • x
  • convention:

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.