This post is about the stream SQL explanation. Please find more details below.
1. Flow calculation SQL principle and architecture
Streaming SQL is usually an SQL-like declarative language, mainly used for continuous querying of streaming data (Streams). In order to be under the common stream computing platform and framework (such as Storm, Spark Streaming, Flink, Beam, etc.) API, reduce the real-time development threshold by building an SQL abstraction layer using a simple and common SQL language.
The principle of stream computing SQL is actually very simple, which is to build a bridge between SQL and the underlying stream computing engine. Stream computing SQL is submitted by the user, translated by the SQL engine layer into the underlying API and on the underlying stream computing engine carried out. For example, for Storm, it will be automatically translated into Storm's task topology and run on the Storm cluster.
The stream computing SQL engine is the core of stream computing SQL. It is mainly responsible for the operations of user SQL input for parsing, semantic analysis, logical plan generation, logical plan execution and physical execution plan generation. The real implementation of computing is the underlying stream computing platform.
Unlike offline tasks, real-time data is constantly flowing in, so in order to use SQL to abstract stream processing, stream computing SQL also introduces the concept of 'table', but the table here is a dynamic table.
The flow of SQL is as follows:
2. Stream computing SQL: the main real-time development technology in the future at present
The flow calculation SQL has a different progress and support in various computing frameworks. Storm SQL is just an experimental feature. Flink SQL is the core API that Flink promotes. Flink is a native open source stream computing engine and there are currently no other open source stream computing engines that offer better streams than Flink. Calculate the SQL framework and syntax, etc., so Flink SQL is actually defining the annotations for the flow calculation SQL.
3. DataFrame and SQL Operations
You can easily use DataFrames and SQL operations on streaming data. You have to create a SparkSession using the SparkContext that the StreamingContext is using. Furthermore, this has to be done so that it can be restarted on driver failures. This is done by creating a lazily instantiated singleton instance of SparkSession. This is shown in the following example. It modifies the earlier word count example to generate word counts using DataFrames and SQL. Each RDD is converted to a DataFrame, registered as a temporary table and then queried via SQL.
You can also run SQL queries on tables defined on streaming data from a different thread (that is, asynchronous to the running StreamingContext). Just make sure that you set the StreamingContext to remember a sufficient amount of streaming data such that the query can run. Otherwise the StreamingContext, which is unaware of the any asynchronous SQL queries, will delete off old streaming data before the query can complete. For example, if you want to query the last batch, but your query can take 5 minutes to run, then call streamingContext.remember(Minutes(5)) (in Scala, or equivalent in other languages).