Got it

HCIA-Big Data | Flink Watermark

Latest reply: Jun 1, 2022 07:04:20 614 17 4 0 0

Hello, everyone!

When you use event time to process stream data, you may encounter out-of-order data. Stream processing takes a certain period of time when an event is generated, enters a source, and is transmitted to an operator. In most cases, data transmitted to an operator are sorted based on the time the events are generated. However, the out-of-order problem caused by network delay may also occur. Disorder indicates that the events received by Flink are not sorted strictly based on the event time. 

Out-of-Order Problem

When an out-of-order event occurs, if window running is determined only based on the event time, it cannot be determined whether all data is ready, but the system cannot wait indefinitely for data to arrive. In this case, a mechanism is required to ensure that the window is triggered for calculation after a specific time. This mechanism is called a watermark. 

A watermark is a mechanism for measuring the progress of event time. It is a hidden attribute of data, which carries watermarks. Watermarks are used to process out-of-order events and are usually implemented together with windows.

Out-of-Order Example

An app records all user clicks and sends logs back. If the network condition is poor, the logs are saved locally and sent back later. User A performs operations on the app at 11:02 and user B performs operations on the app at 11:03. However, the network of user A is unstable and log backhaul is delayed. As a result, the server receives the message from user B at 11:03 and then the message from user A at 11:02.

Why Are Watermarks Needed?

For infinite data sets, there is no effective way to determine data integrity. Therefore, the watermark is a concept based on event time to describe the integrity of data streams. If events are measured by processing time, everything is ordered and perfect. Therefore, watermarks are not required. In other words, 

event time causes disorder. Watermarks are used to solve the disorder problem. 

The so-called disorder is actually an event delay. For a delayed element, it is impossible to wait indefinitely. A mechanism must be provided to ensure that the window is triggered for calculation after a specific time. This special mechanism is a watermark, which tells operators that delayed messages should no longer be received.

Watermark Principles 

How does Flink ensure that all data has been processed when the event time-based window is destroyed? This is what a watermark is. A watermark is a monotonically increasing timestamp t. Watermark(t) indicates that all data whose timestamp is less than or equal to t has arrived, and data whose timestamp is less than or equal to twill not be received in the future. Therefore, the window can be safely triggered and 

destroyed.

stream

Then, how does Flink calculate the watermark value? 

Watermark = Maximum time an event enters Flink (maxEventTime) – Specified delay time (t) 

How does a window with watermarks trigger the window function? 

The following figure shows the watermark of ordered streams (Watermark is set to 0).

Watermark is set to 0

The following figure shows the watermark of out-of-order streams (Watermark is set to 2).

Watermark is set to  2

For each piece of data Flink receives, a watermark is generated. The watermark is equal to the value of maxEventTime minus the delay. When the watermark carried with the data is later than the stop time of the window that is not triggered, the corresponding window is triggered. Watermarks are carried with data. Therefore, if new data cannot be obtained during running, windows that are not triggered will never be triggered. 

In the preceding figure, the allowed maximum delay is 2s. Therefore, the watermark of the event whose timestamp is 7s is 5s, and the watermark of the event whose timestamp is 12s is 10s. If window 1 is 1s to 5s and window 2 is 6s to 10s, window 1 is triggered when an event whose timestamp is 7s arrives and window 2 is triggered when an event whose timestamp is 12s arrives. 

Delayed Data

Watermark is a means of coping with out-of-order data, but in the real world we cannot obtain a perfect watermark value - either it cannot be obtained, or it is too costly. Therefore, in actual use, we will use an approximate Watermark(t) value, but there is still a low probability that the data before the timestamp t will be received. In Flink, the data is defined as late elements. Similarly, we can specify the maximum latency allowed in the window (0 by default), which can be set using the following code:

Delayed Data

Delayed Data Processing Mechanism

Delayed events are special out-of-order events. Different from common out-of-order events, their out-of-order degree exceeds that watermarks can predict. As a result, the window is closed before they arrive.

In this case, you can use any of the following methods to solve the problem:

  • Reactivate the closed windows and recalculate to correct the results.

  • Collect the delayed events and process them separately.

  • Consider delay events as error messages and discard them.

By default, Flink discards the third method and uses Side Output and Allowed Lateness instead.

Side Output Mechanism

In the Side Output mechanism, a delayed event can be placed in an independent data stream, which can be used as a by-product of the window calculation result so that users can obtain and perform special processing on the delay event.

Getting Delayed Data as a Side Output

After allowedLateness is set, delayed data can also trigger the window for output. Using the side output mechanism of Flink, we can obtain this delayed data in the following way:

Getting Delayed Data as a Side Output

Allowed Lateness Mechanism

Allowed Lateness allows you to specify a maximum allowed lateness. After the window is closed, Flink keeps the state of windows until their allowed lateness expires. During this period, the delayed events are not discarded, but window recalculation is triggered by default. Keeping the window state requires extra memory. If the ProcessWindowFunction API is used for window calculation, each delayed event may 

trigger a full calculation of the window, which costs a lot. Therefore, the allowed lateness should not be too long, and the number of delayed events should not be too many.

That's all, thanks!

The post is synchronized to: Huawei Cloud Computing Case

  • x
  • convention:

wissal
MVE Created Apr 28, 2022 15:48:20

Learn to learn, increase our knowledge!
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 29, 2022 00:41:46 (0) (0)
Learning together  
RNT
Created Apr 28, 2022 16:27:16

interesting information, the principle is good - implementations still have room to grow
View more
  • x
  • convention:

Irshadhussain
Irshadhussain Created Apr 28, 2022 16:59:36 (0) (0)
 
Irshadhussain
Irshadhussain Created Apr 28, 2022 16:59:43 (0) (0)
 
olive.zhao
olive.zhao Created Apr 29, 2022 00:42:04 (0) (0)
Thanks!  
gabo.lr
MVE Created Apr 28, 2022 16:45:35

Good information!
View more
  • x
  • convention:

Ayeshaali
Ayeshaali Created Apr 28, 2022 16:58:55 (0) (0)
 
olive.zhao
olive.zhao Created Apr 29, 2022 00:42:20 (0) (0)
Thanks!  
Ayeshaali
Created Apr 28, 2022 16:58:47

Good
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Apr 29, 2022 00:42:14 (0) (0)
Thanks!  
MahMush
Moderator Author Created May 14, 2022 06:46:22

A Watermark Strategy can be used in two places in Flink applications: 1) directly on sources and 2) after non-source operations.
View more
  • x
  • convention:

Saqibaz
Created May 31, 2022 03:01:59

Good information
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Jun 1, 2022 03:25:09 (0) (0)
Thanks!  
KasimAbubakr
Created May 31, 2022 10:03:46

thank
View more
  • x
  • convention:

Suhail
Created Jun 1, 2022 07:04:20

Thank you for providing such valuable information.
View more
  • x
  • convention:

olive.zhao
olive.zhao Created Jun 6, 2022 08:25:39 (0) (0)
 

Comment

You need to log in to comment to the post Login | Register
Comment

Notice: To protect the legitimate rights and interests of you, the community, and third parties, do not release content that may bring legal risks to all parties, including but are not limited to the following:
  • Politically sensitive content
  • Content concerning pornography, gambling, and drug abuse
  • Content that may disclose or infringe upon others ' commercial secrets, intellectual properties, including trade marks, copyrights, and patents, and personal privacy
Do not share your account and password with others. All operations performed using your account will be regarded as your own actions and all consequences arising therefrom will be borne by you. For details, see " User Agreement."

My Followers

Login and enjoy all the member benefits

Login

Block
Are you sure to block this user?
Users on your blacklist cannot comment on your post,cannot mention you, cannot send you private messages.
Reminder
Please bind your phone number to obtain invitation bonus.
Information Protection Guide
Thanks for using Huawei Enterprise Support Community! We will help you learn how we collect, use, store and share your personal information and the rights you have in accordance with Privacy Policy and User Agreement.