当前位置：网站首页>Flink SQL realizes the integration of stream and batch

Flink SQL realizes the integration of stream and batch

2022-04-23 09:02:00 【Big data Institute】

1 Flink Stream and batch unify thought

1.1 Bounded flow and unbounded flow

stay Flink in , A special case of stream processing in batch processing .

1.2 Flink Old architecture and problems

1. from Flink User perspective （ Enterprise developers ）

（1） When developing ,Flink SQL The support is not good , Two at the bottom API To choose from , Even maintain two sets of code .

（2） Different semantics 、 Different connector Support 、 Different error recovery strategies, etc .

（3）Table API Will also be affected by different underlying API、 Different connector And so on .

2. from Flink From the perspective of developers （Flink Community people ）

（1） Different translation processes , Different operator implementations 、 Different Task perform .

（2） Code is hard to reuse .

（3） Two independent technology stacks need more manpower , Function development slows down 、 Performance improvement becomes difficult 、bug More .

1.3 Flink A new architecture integrating flow and batch

2 Flink layered API

1.API Hierarchical structure

（1）Flink1.12 Previous version ,Table API and SQL In the active development stage , There is no stream batch system ⼀ All features of , So you need to be careful when using .

（2） from Flink1.12 Start ,Table API and SQL It's already mature , It can be safely used in production .

2.Flink Table API/SQL Implementation process of

3 Flink The flow batch is unified

3.1 Relational algebra and stream processing

Flink Provided by the upper layer Table API and SQL It is a unified flow batch , That is, whether it's stream processing ( By incident ) Batch processing ( Bounded flow ),Table API and SQL All have the same meaning .

We all know SQL Designed for relational models and batch processing , therefore SQL Query is difficult to realize and understand in stream processing , Let's start with some special concepts of stream processing to help you understand Flink How to execute on stream processing SQL Of .

Relational algebra （ It mainly refers to the tables in the relational database ） and SQL, It's mainly for batch processing , There's a natural gap between this and stream processing .

3.2 Understand dynamic tables and continuous queries

In order to use relational algebra in stream processing (Table API/SQL),Flink Dynamic table is introduced (Dynamic Tables) The concept of .

Because the data faced by stream processing is an unbounded data stream , This is the same as what we are familiar with in a relational database “ surface ” Completely different , So one idea is to convert the data flow into Table, And then execute SQL operation , however SQL The result of implementation is not invariable , But constantly updated with the arrival of new data .

With the arrival of new data , Constantly update the previous results , The resulting table , stay Flink Table API In concept , It's called “ Dynamic table ”（Dynamic Tables）.

Dynamic tables are Flink The flow of data Table API and SQL The core concept of support . Unlike static tables that represent batch data , Dynamic tables change over time . Dynamic tables can be queried like static batch tables , Querying a dynamic table produces persistent queries （Continuous Query）. Continuous queries never terminate , And another dynamic table is generated . Inquire about （Query） The dynamic result table will be updated continuously , To reflect changes on its dynamic input table .

When executing a relational query on a data stream , The main steps of the transformation diagram between data flow and dynamic table are as follows .

（1） Convert data flow to dynamic table .

（2） Continuous query on dynamic table , And generate a new dynamic table .

（3） The generated dynamic table is transformed into a new data stream .

3.3 Detailed explanation of dynamic table

1. Define dynamic tables （ Click event flow ）

CREATE TABLE clicks (

user VARCHAR, -- user name

url VARCHAR, -- User accessed URL

cTime TIMESTAMP(3) -- Access time

) WITH (...);

2. Define dynamic tables （ Click event flow ）

To execute relational queries , First, you have to convert the data flow into a dynamic table . Click stream is shown on the left of the figure below , On the right is the dynamic table , All new events on the stream will correspond to... On the dynamic table insert operation （ except insert There are other models , I'll talk about it later ）.

3. Continuous query

Next , We execute continuous queries on the dynamic table to generate a new dynamic table （ Result sheet ）, Continuous queries do not stop , It will constantly query, calculate and update the result table according to the arrival of new data in the input table （ Multiple modes , Later on ）. We are click On the dynamic table group by count Aggregate query , Over time , The dynamic result table on the right follows the left measurement ⼊ Table the change of each data ⽽ change .

We are click On the dynamic table group by count polymerization , In addition, a tumbling window is added , Statistics 1 Number of visits per user in the hourly rollover window . Over time , The dynamic result table on the right changes with the data of the left test input table , But the results of each window are independent , And the calculation is triggered at the end of each window .

3.4 Convert dynamic table to data flow

Just like regular database tables , Dynamic tables can be created by inserting （Insert）、 to update （Update） And delete （Delete） change , Make continuous changes . When converting a dynamic table to a stream or writing it to an external system , These changes need to be coded .Flink Of Table API and SQL Supports three ways to encode changes to dynamic tables .

1. Just append the stream (Append-only stream, namely insert-only)

Only by inserting INSERT Change to modify the dynamic table , It can be converted directly to “ Just add ” flow . The data sent out in this flow is every new event in the dynamic table .

2. Withdrawal flow (Retract stream)

Insert 、 to update 、 Deleting a dynamic table that is supported by both will be converted to a withdrawal stream .

The withdrawal flow contains two types of messages ： add to （Add） Messages and withdrawals （Retract） news .

Dynamic tables can be used to INSERT Encoded as add news 、DELETE Encoded as retract news 、UPDATE Encoded as the changed line （ Before change ） Of retract After messages and updates （ New line ） Of add news , Convert to retract flow .

3. Update insert stream (Upsert flow )

Upsert Flows contain two types of messages ：Upsert News and delete news . Convert to upsert Dynamic table of flows , You need to have a unique key （key）. By way of INSERT and UPDATE Change the code to upsert news , take DELETE Change the code to DELETE news , You can have a unique key （Unique Key） Dynamic table to stream .

4. Query restrictions

Continuous queries on unbounded data streams have some limitations , Mainly in the following two aspects ：

（1） Status size limit

Continuous queries on unbounded data streams need to run for weeks or months or even longer , therefore , The amount of data processed by continuous query may be very large . for example , In the previous example of calculating user visits , It is necessary to maintain the count status of user visits , If only registered users are considered, the status will not be too large , If you assign a unique user name to each unregistered user , You need to maintain a very large state , Over time, it may cause the query to fail .

SELECT user, COUNT(url)

FROM clicks

GROUP BY user;

（2） Calculate update cost limit

Some queries even add or update only one input record , You also need to recalculate and update most of the issued result lines . Such queries are not suitable for execution as continuous queries . For example, the following example , It calculates a for each user based on the time of the last click RANK. once clicks New form received , User lastAction Will update and calculate the new ranking . however , Because two rows cannot have the same ranking , All lower ranked rows also need to be updated .

SELECT user, RANK() OVER (ORDER BY lastAction)

FROM (

SELECT user, MAX(cTime) AS lastAction FROM clicks GROUP BY user

);

版权声明
本文为[Big data Institute]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230723184774.html