Hello, everyone!
Introduction to Flink Time
In stream processor programming, time processing is very critical. For example, event stream data (such as server log data, web page click data, and transaction data) is continuously generated. In this case, you need to use keys to group events and count the events corresponding to each key at a specified interval. This is our known "big data" application.
Time Classification in Stream Processing
During stream processing, the system time (processing time) is often used as the time of an event. However, the system time (processing time) is the time that is imposed on an event. Due to reasons such as network delay, the processing time cannot reflect the sequence of events.
In actual cases, the time of each event can be classified into the following types:
Event time: The time when an event occurs;
Ingestion time: The time when an event arrives at the stream processing system;
Processing time: The time when an event is processed by the system;
For example, if a log enters Flink at 10:00:00.123 on November 12, 2019 and reaches Window at 10:00:01.234 on November 12, 2019, the log content is as follows:
2019-11-02 10:37.624 INFO Fail over to rm2
2019-11-02 10:37.624 is the event time.
2019-11-02 10:37.123 is the ingestion time.
2019-11-02 10:37.234 is the processing time.
Differences Between Three Time
In the actual situation, the sequence of events is different from the system time. These differences are caused by network delay and processing time. See the following figure.
The horizontal coordinate indicates the event time, and the vertical coordinate indicates the processing time. Ideally, the coordinate formed by the event time and processing time should form a line with a tilt angle of 45 degrees. However, the processing time is later than the event time. As a result, the sequence of events is inconsistent.
Time Semantics Supported by Flink
Table: Differences between processing time and event time
Processing Time | Event Time (Row Time) |
Time of the real world | Time when data is generated |
Local time when the data node is processed | Timestamp carried in a record |
Simple processing | Complex processing |
Uncertain results (cannot be reproduced) | Certain results (reproducible) |
For most streaming applications, it is valuable to have the ability to reprocess historical data and produce consistent results with certainty using the same code used to process real-time data.
It is also critical to note the sequence in which events occur rather than the sequence in which they are processed, and the capability to infer when (or should) a set of events is completed. For example, consider a series of events involved in e-commerce transactions or financial transactions. These requirements for stream processing can be met using the event timestamp recorded in the data stream instead of using the
clock of the machine that processes the data.
Introduction Flink Windows
A streaming system is a data processing engine designed to process infinite data sets. Infinite data sets refer to constantly growing data sets. Window is a method for splitting infinite data sets into finite blocks for processing.
Windows are at the heart of processing infinite streams. Windows split the stream into "buckets" of finite size, over which we can apply computations.
Window Types
Windows can be classified into two types:
Count Window: Data-driven window is generated based on the specified number of data records, which is irrelevant to time.
Time Window: Time-driven window is generated based on time.
Apache Flink is a distributed computing framework that supports infinite stream processing. In Flink, Window can divide infinite streams into finite streams.
Time Window Types
According to the window implementation principles, Time Window can be classified into tumbling window, sliding window, and session window.
1. Tumbling Window
The data is sliced according to a fixed window length. Characteristics: The time is aligned, the window length is fixed, and the windows do not overlap.
Application scenario: BI statistics collection (aggregation calculation in each time period)
For example, assume that you want to count the values emitted by a sensor. A tumbling window of 1 minute collects the values of the last minute and emits their sum at the end of the minute.
See the following figure.
In the first sliding window, the values 9, 6, 8, and 4 are summed, yielding the result 27. Next, the window slides by a half minute and the values 8, 4, 7, and 3 are summed, yielding the result 22, etc.
2. Sliding Window
The sliding window is a more generalized form of a fixed window. It consists of a fixed window length and a sliding interval. Characteristics: The time is aligned, the window length is fixed, and the windows can be overlapping.
Application scenario: Collect statistics on the failure rate of an interface over the past 5 minutes to determine whether to report an alarm.
A sliding window of 1 minute that slides every half minute counts the values of the last minute, emitting the count every half minute. See the following figure.
3. Session Window
The session window is used to aggregate data with high activity in a certain period into a window for calculation. The triggering condition of the window is the session gap. If no data is received within the specified time, the window ends, and the window calculation is triggered. Unlike the tumbling window and sliding window, the session window does not require a fixed sliding value or window size. You only need to set the session interval for triggering window calculation and specify the upper limit of the inactive data duration.
Code Definition
A tumbling time window of 1 minute can be defined in Flink simply as:
stream.timeWindow(Time.minutes(1));
A sliding time window of 1 minute that slides every 30 seconds can be defined as simply as:
stream.timeWindow(Time.minutes(1),Time.seconds(30));
Assume that a user performs a series of operations (such as selecting, browsing, searching, purchasing, and switching tabs) after opening an APP on a mobile phone. These operations and the corresponding occurrence time are sent to the server for user behavior analysis. User behavior is a segment, and behaviors in each segment are continuous. A correlation degree of behaviors within a segment is far greater than a correlation degree of behaviors between segments. Each segment of user behavior is called a session, and the gap between segments is called a session gap. Therefore, we should divide the user's behaviors according to the session window and calculate the result of each session.
In this way, Flink automatically places elements in different session windows based on the timestamps of the elements. If the timestamp interval between two elements is less than the session gap, the two elements are in the same session. If the interval between two elements is greater than the session gap and no element can fill in the gap, the two elements are placed in different sessions.
That's all, thanks!