当前位置：网站首页>HQL statement tuning

HQL statement tuning

2022-04-23 12:57:00 【Look at the data at the top of the mountain】

1、 De duplicating sentences ( use group by Instead of distinct)

Group by It also has the function of weight removal , The details are as follows

select distinct customer_id
from test_join_order;
=>
select customer_id
from test_join_order
group by customer_id;


 In a huge amount of data ( And there are many duplicate values ) when , You can start with group by duplicate removal , stay count() Count , Efficiency is higher than count(distinct col)



create table if not exists test1(
id int
)
row format delimited
fields terminated by ',';

insert into test1 values(1);
insert into test1 values(2);
insert into test1 values(1);
insert into test1 values(3);
insert into test1 values(1);
insert into test1 values(2);
insert into test1 values(4);

select count(distinct(id))
from test1;

select count(t1.id)
from(
select id 
from test1
group by id) t1;

2、 Aggregation skills —— Using window functions grouping sets、cube、rollup

This window function is the improvement and encapsulation of the previous one

group by quantity,sex,age
grouping sets(quantity,sex,age)

group by quantity,sex,age
with cube;

group by quantity,sex
with rollup;

--  Number distribution 
select quantity,count(*) 
from test2
group by quantity;
2       1
3       2
4       1
5       1
7       2

--  Gender distribution 
select sex,count(*) 
from test2
group by sex;
0       3
1       4

--  Age distribution 
select age,count(*) 
from test2
group by age;
16      2
19      1
21      1
23      1
34      2

 Shortcomings should be written three times SQL, It needs to be performed three times , Repeat the work , And trouble 

--  An optimization method ( The aggregation results are all in the same column , Classification fields are partitioned by different columns )

select quantity,sex,age,count(*)
from test2
group by quantity,sex,age
grouping sets(quantity,sex,age)

NULL    NULL    16      2
NULL    NULL    19      1
NULL    NULL    21      1
NULL    NULL    23      1
NULL    NULL    34      2
NULL    0       NULL    3
NULL    1       NULL    4
2       NULL    NULL    1
3       NULL    NULL    2
4       NULL    NULL    1
5       NULL    NULL    1
7       NULL    NULL    2

 Is to put the results of three columns together 

 Here are the test data 
create table if not exists test2(
quantity int,
sex int,
age int
)
row format delimited
fields terminated by ',';

insert into test2 values(5,1,16);
insert into test2 values(3,0,23);
insert into test2 values(4,1,16);
insert into test2 values(7,0,34);
insert into test2 values(2,0,19);
insert into test2 values(7,1,21);
insert into test2 values(3,1,34);

cube: according to group by  Dimensional ( all ) Combine to aggregate 

select quantity,sex,age,count(*)
from test2
group by quantity,sex,age
grouping sets(quantity,sex,age,(sex,age),(quantity,sex),(quantity,age),(quantity,sex,age));

(sex,age) It's the same sex and age How many of them are there 

 The above can be written as 
select quantity,sex,age,count(*)
from test2
group by quantity,sex,age
with cube;

cube
 Britain  [kjuːb]    beautiful  [kjuːb]  
n.
 Cube ; Cuboid ; Something cubic ( Especially food ); cube ; third power 
vt.
 seek … The cube of ; hold ( food ) Cut into cubes 

rollup It is mainly used for hierarchical statistics , This is equivalent to the effect of summarizing each layer 

select quantity ,sex ,sum(age)
from test2
group by quantity,sex
with rollup;

2       NULL    19
2       0       19
3       NULL    57
3       0       23
3       1       34
4       NULL    16
4       1       16
5       NULL    16
5       1       16
7       NULL    55
7       0       34
7       1       21

3、 Table join optimization

1、 The watch is in front of , The big watch is at the back 
	hive Suppose that the last one in the query is a large table , It caches other tables , Then scan the last watch 

2、 Use the same connection to build 
	 When the 3 A table or more join When the connection , If each on The clauses all use the same join key , So there's only one MR job
	
3、 Filter data as early as possible 
	 Reduce the amount of data in each stage , Partition the partition table , At the same time, only the fields that need to be used 
	 When logic is too complex , Import intermediate table

4、 How to solve data skew

1、 Data skew performance ：
	 The task progress is maintained for a long time 99%（100%）, View the task monitoring page , Found that only a small amount （1 One or more ）reduce Subtask not completed . Because of the amount of data it processes and other reduce The difference is too big 
	
2、 Causes and solutions of data skew ：
	1、 Data skew caused by null value 
	 solve ： If two tables are connected , The connection conditions used have many null values , It is suggested to add filtering in the connection conditions 
	eg:on a.user_id = b.user_id and a.user_id is not null
	2、 Size table connection （ One of the tables is big , The other table is very small ）
	 solve ： Put the small table in memory , stay map End to do join
	select /*+mapjoin(a)*/ a.user_id,a.user_name
	 from user_info a
	 join join_order b
	 where b.customer_id = a.customer_id;
	 
	  The method of putting data into memory in advance 
	 /*+mapjoin(a)*/
	 
	 3、 The field data types of the two table connection conditions are inconsistent 
	  solve ： Convert the field data type of the connection condition to a consistent 
	 
	 on a.user_id = cast(b.user_id as string)

版权声明
本文为[Look at the data at the top of the mountain]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204230615025280.html

当前位置：网站首页>HQL statement tuning

HQL statement tuning

边栏推荐

猜你喜欢

随机推荐