当前位置:网站首页>HQL statement tuning
HQL statement tuning
2022-04-23 12:57:00 【Look at the data at the top of the mountain】
1、 De duplicating sentences ( use group by Instead of distinct)
Group by It also has the function of weight removal , The details are as follows
select distinct customer_id
from test_join_order;
=>
select customer_id
from test_join_order
group by customer_id;
In a huge amount of data ( And there are many duplicate values ) when , You can start with group by duplicate removal , stay count() Count , Efficiency is higher than count(distinct col)
create table if not exists test1(
id int
)
row format delimited
fields terminated by ',';
insert into test1 values(1);
insert into test1 values(2);
insert into test1 values(1);
insert into test1 values(3);
insert into test1 values(1);
insert into test1 values(2);
insert into test1 values(4);
select count(distinct(id))
from test1;
select count(t1.id)
from(
select id
from test1
group by id) t1;
2、 Aggregation skills —— Using window functions grouping sets、cube、rollup
This window function is the improvement and encapsulation of the previous one
group by quantity,sex,age
grouping sets(quantity,sex,age)group by quantity,sex,age
with cube;group by quantity,sex
with rollup;
-- Number distribution
select quantity,count(*)
from test2
group by quantity;
2 1
3 2
4 1
5 1
7 2
-- Gender distribution
select sex,count(*)
from test2
group by sex;
0 3
1 4
-- Age distribution
select age,count(*)
from test2
group by age;
16 2
19 1
21 1
23 1
34 2
Shortcomings should be written three times SQL, It needs to be performed three times , Repeat the work , And trouble
-- An optimization method ( The aggregation results are all in the same column , Classification fields are partitioned by different columns )
select quantity,sex,age,count(*)
from test2
group by quantity,sex,age
grouping sets(quantity,sex,age)
NULL NULL 16 2
NULL NULL 19 1
NULL NULL 21 1
NULL NULL 23 1
NULL NULL 34 2
NULL 0 NULL 3
NULL 1 NULL 4
2 NULL NULL 1
3 NULL NULL 2
4 NULL NULL 1
5 NULL NULL 1
7 NULL NULL 2
Is to put the results of three columns together
Here are the test data
create table if not exists test2(
quantity int,
sex int,
age int
)
row format delimited
fields terminated by ',';
insert into test2 values(5,1,16);
insert into test2 values(3,0,23);
insert into test2 values(4,1,16);
insert into test2 values(7,0,34);
insert into test2 values(2,0,19);
insert into test2 values(7,1,21);
insert into test2 values(3,1,34);
cube: according to group by Dimensional ( all ) Combine to aggregate
select quantity,sex,age,count(*)
from test2
group by quantity,sex,age
grouping sets(quantity,sex,age,(sex,age),(quantity,sex),(quantity,age),(quantity,sex,age));
(sex,age) It's the same sex and age How many of them are there
The above can be written as
select quantity,sex,age,count(*)
from test2
group by quantity,sex,age
with cube;
cube
Britain [kjuːb] beautiful [kjuːb]
n.
Cube ; Cuboid ; Something cubic ( Especially food ); cube ; third power
vt.
seek … The cube of ; hold ( food ) Cut into cubes
rollup It is mainly used for hierarchical statistics , This is equivalent to the effect of summarizing each layer
select quantity ,sex ,sum(age)
from test2
group by quantity,sex
with rollup;
2 NULL 19
2 0 19
3 NULL 57
3 0 23
3 1 34
4 NULL 16
4 1 16
5 NULL 16
5 1 16
7 NULL 55
7 0 34
7 1 21
3、 Table join optimization
1、 The watch is in front of , The big watch is at the back
hive Suppose that the last one in the query is a large table , It caches other tables , Then scan the last watch
2、 Use the same connection to build
When the 3 A table or more join When the connection , If each on The clauses all use the same join key , So there's only one MR job
3、 Filter data as early as possible
Reduce the amount of data in each stage , Partition the partition table , At the same time, only the fields that need to be used
When logic is too complex , Import intermediate table
4、 How to solve data skew
1、 Data skew performance :
The task progress is maintained for a long time 99%(100%), View the task monitoring page , Found that only a small amount (1 One or more )reduce Subtask not completed . Because of the amount of data it processes and other reduce The difference is too big
2、 Causes and solutions of data skew :
1、 Data skew caused by null value
solve : If two tables are connected , The connection conditions used have many null values , It is suggested to add filtering in the connection conditions
eg:on a.user_id = b.user_id and a.user_id is not null
2、 Size table connection ( One of the tables is big , The other table is very small )
solve : Put the small table in memory , stay map End to do join
select /*+mapjoin(a)*/ a.user_id,a.user_name
from user_info a
join join_order b
where b.customer_id = a.customer_id;
The method of putting data into memory in advance
/*+mapjoin(a)*/
3、 The field data types of the two table connection conditions are inconsistent
solve : Convert the field data type of the connection condition to a consistent
on a.user_id = cast(b.user_id as string)
版权声明
本文为[Look at the data at the top of the mountain]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204230615025280.html
边栏推荐
- C#,二维贝塞尔拟合曲线(Bézier Curve)参数点的计算代码
- Recovering data with MySQL binlog
- Trier les principales utilisations de l'Agent IP réseau
- 拥抱机器视觉新蓝海,冀为好望开启数字经济发展新“冀”遇
- 如何实现点击一下物体播放一次动画
- [Blue Bridge Cup] April 17 provincial competition brushing training (the first three questions)
- 标签与路径
- Deploying MySQL in cloud native kubesphere
- BUUCTF WEB [BUUCTF 2018]Online Tool
- leetcode:437. 路径总和 III【dfs 选还是不选?】
猜你喜欢
Deploying MySQL in cloud native kubesphere
Date time type in database
8 websites that should be known for product development to enhance work experience
解决disagrees about version of symbol device_create
STM32 project transplantation: transplantation between chip projects of different models: Ze to C8
Free and open source intelligent charging pile SaaS cloud platform of Internet of things
Kubernets Getting started tutoriel
Remote sensing image classification and recognition system based on convolutional neural network
Idea的src子文件下无法创建servlet
Record the problems encountered in using v-print
随机推荐
如何防止网站被黑客入侵篡改
航芯技术分享 | ACM32 MCU安全特性概述
Byte jump 2020 autumn recruitment programming question: quickly find your own ranking according to the job number
Unable to create servlet under SRC subfile of idea
Realize several "Postures" in which a box is horizontally and vertically centered in the parent box
标签与路径
C#,二维贝塞尔拟合曲线(Bézier Curve)参数点的计算代码
Sort out several uses of network IP agent
NPDP|产品经理如何做到不会被程序员排斥?
数据库中的日期时间类型
Buuctf Web [bjdctf2020] zjctf, but so
Resolve disagrees about version of symbol device_ create
PC starts multiple wechat at one time
Free and open source agricultural Internet of things cloud platform (version: 3.0.1)
Packet capturing and sorting -- TCP protocol [8]
MySQL supports IP access
Kubernetes 入門教程
Synchronously update the newly added and edited data to the list
云原生KubeSphere部署Mysql
使用Source Insight查看编辑源代码