当前位置：网站首页>Good code in the eyes of a compiler engineer: Loop Interchange

Good code in the eyes of a compiler engineer: Loop Interchange

2022-08-05 14:56:00 【51CTO】

摘要：本文将以Loop Interchange的场景为例,Describe the way you can get better performance when writing code.

本文分享自华为云社区《 Good code in the eyes of a compiler engineer（1）：Loop Interchange》,作者：Bi Sheng's assistant.

编者按：C/C++代码在编译时,The compiler translates the source code intoCPURecognized sequence of instructions and generate executable code,The efficiency of the final code depends on the executable code generated by the compiler.在大部分情况下,The degree to which the program can be optimized by the compiler is determined when the source code is written.Even minor changes to source code can have a significant impact on how efficiently the code generated by the compiler runs.因此,Source code optimization can help the compiler to generate more efficient executable code to a certain extent.

本文将以Loop Interchange的场景为例,Describe the way you can get better performance when writing code.

1、Loop Interchange 相关基本概念

1.1 访问局部性

Access locality refers to when an application accesses memory in computer science,倾向于访问内存中较为靠近的值.This locality is a predictable behavior that occurs in computer systems,We can take advantage of this strong access locality of the system for performance optimization.访问局部性分为三种基本形式,temporal locality、空间局部性、sequential locality.

This article mainly discusses theLoop InterchangeMainly using the spatial locality.空间局部性指的是,最近引用过的内存位置以及其周边的内存位置容易再次被使用.more common in loops,e.g. in an array,如果第3Elements in a loop is used,Then in this cycle it is very likely that the first4个元素;If the cycle does use first4个元素,is to hit the previous iterationprefetch到的cache数据.

So for the array loop operation,The feature of spatial locality can be exploited,Ensure that two adjacent loop access to an array element in memory is more close to,i.e. when looping through elements in an arraystride越小,Corresponding performance may be optimized.

那么,How are arrays stored in memory?？

1.2 Row-major 和 Column-major

Row-major 和 Column-major are two ways to store multidimensional arrays in linear storage.Elements of an array are contiguous in memory;Row-major orderingConsecutive elements representing rows are next to each other in memory,而Column-major orderingthen the consecutive elements representing the column are adjacent to each other,如下图所示.

Good code in the eyes of a compiler engineer：Loop Interchange_编译器

虽然Row和ColumnThe name looks like it refers specifically to a 2D array,但是Row-major和Column-majorCan also be generalized for arrays of any dimension.

那么在C/C++中,In which way is the array stored?？

举一个小例子,用cachegrind工具来展示Cusing two different forms of accessCPU的cacheLoss rate comparison.

按行访问：

       
       #include <stdio.h>
       
int main(){
       
 size_t i,j;
       
 const size_t dim = 1024 ;
       
    int matrix [dim][dim];
       
 for (i=0;i< dim; i++)
       
 for (j=0;j <dim;j++)
       
            matrix[i][j]= i*j;
       
 return 0;
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.

Good code in the eyes of a compiler engineer：Loop Interchange_Loop Interchange_02

按列访问：

       
       #include <stdio.h>
       
int main(){
       
 size_t i,j;
       
 const size_t dim = 1024 ;
       
    int matrix [dim][dim];
       
 for (i=0;i< dim; i++){
       
 for (j=0;j <dim;j++){
       
            matrix[j][i]= i*j;
       
 }
       
 }
       
 return 0;
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.

Good code in the eyes of a compiler engineer：Loop Interchange_数组_03

根据上述CWhen two different accesses to the same array in the codecacheComparison of loss rates,可以说明在C/C++代码中,数组是以Row-major形式储存的.也就是说,If the previous step visiteda[i][j],那么对a[i][j+1]的访问会命中cache.So as not to perform access to main memory,而cachefaster than main memory,So follow the storage form of the corresponding programming language to make it hitcachemay lead to optimization.

As for other commonly used programming languages,Fortran、MATLABetc is the defaultColumn-major形式.

1.3 Loop Interchange

Loop InterchangeTake advantage of the system's tendency to access values that are closer in memory andC/C++ Row-major的特点,By changing the order of execution between two loops in a loop nest,Increase overall code space locality.此外,It can also enable other important transcoding,例如,Loop Reordering就是Loop InterchangeOptimisation that extends to when more than two loops are reordered.在LLVM中,Loop Interchangeneed to be enabled-mllvm -enable-loopinterchange选项启用.

2、优化示例

2.1 基础场景

Just look at the following example of a matrix operation：

原始代码：

       
       for(int i = 0; i < 2048; i++) {
       
 for(int j = 0; j < 1024; j++) {
       
 for(int k = 0; k < 1024; k++) {
       
 C[i * 1024 + j] += A[i * 1024 + k] * B[k * 1024 + j];
       
 }
       
 }
       
 }
      
1.
2.
3.
4.
5.
6.
7.

试着把jktwo-tier cycleLoop Interchange之后的代码：

       
       for(int i = 0; i < 2048; i++) {
       
 for(int k = 0; k < 1024; k++) {
       
 for(int j = 0; j < 1024; j++) {
       
 C[i * 1024 + j] += A[i * 1024 + k] * B[k * 1024 + j];
       
 }
       
 }
       
 }
      
1.
2.
3.
4.
5.
6.
7.

可以发现,在原始代码中,最内层的k每次迭代,CThe data to be accessed does not change,A每次访问的stride为1,high probability of hitcache,但Bdue to each visitstride为1024,almost every timecache miss.

Loop Interchange之后,jin the innermost loop,每次迭代时AThe data to be accessed does not change each time,C和B每次访问的stride为1,will have a high probability of hittingcache,cacheThe hit rate is greatly increased.

那么cacheDoes the hit rate really increase?,And what about the performance of both？

原始代码：

       
       $ time -p ./a.out
      
1.

Good code in the eyes of a compiler engineer：Loop Interchange_ide_04

       
       $ sudo perf stat -r 3 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads ./a.out
      
1.

Good code in the eyes of a compiler engineer：Loop Interchange_编译器_05

       
       Loop Interchange后的结果如下：
       
$ time -p ./a.out
      
1.
2.

Good code in the eyes of a compiler engineer：Loop Interchange_ide_06

       
       $ sudo perf stat -r 3 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads ./a.out
      
1.

Good code in the eyes of a compiler engineer：Loop Interchange_llvm社区_07

两者相比：

L1-dcache-loadsalmost the number of,Because the total amount of data to be accessed is about the same;
L1-dcache-load-misses所占L1-dcache-loadsproportion is going onloop interchangeAfter the code modification, the reduction is nearly10倍.

同时,Can also bring close to the performance data9.5%的性能提升.

2.2 特殊场景

当然,在实际使用时,Not all scenarios are2.1The neat can be shown inLoop Interchange场景.

       
       for ( int i = 0; i < N; ++ i )
       
{
       
 if( I[i] != 1 ) continue;
       
 for ( int m = 0; m < M; ++ m )
       
 {
       
    Res2 = res[m][i] * res[m][i];
       
    norm[m] += Res2;
       
 }
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.

as above scenario,如果Nis a very large array,那么Loop InterchangeTheoretically, it can bring greater benefits;However, due to the addition of a branch judgment in the middle of the two-layer loop,Cause could have beenLoop Interchangescenarios cannot be realized.

针对这种场景,Consider stripping the middle branch judgment logic,can be guaranteedLoop Interchange使得数组resaccess on contiguous memory;As for the intermediate judgment branch logic,可以在Loop InterchangeRewind after two layers of loops.

       
       for ( int m = 0; m < M; ++ m )
       
{ 
       
 for ( int i = 0; i < N; ++ i)
       
 { 
       
      Res2 = res[m][i] * res[m][i];
       
      norm[m] += Res2;
       
 if( I[i] != 1 ) //补充逻辑,Guaranteed source code semantics
       
 { 
       
        norm[m] -= Res2;
       
 continue;
       
 } 
       
 }
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.

当然,Such source code modifications also need to be consideredcost是否值得,如果该ifBranch entry frequency is very high,Then the fallback bringscost也会较大,may need to be reconsideredLoop Interchange是否值得;反之,If the branch entry frequency is very low,那么Loop Interchangecan still bring considerable benefits.

3、Used for compilerLoop Interchange pass社区的贡献

Bi Sheng compiler team inllvm社区中对Loop Interchange pass也做出了不小的贡献.团队从legality、profitability等方面对Loop Interchange passcomprehensive enhancement,also thepassThe supported scenarios have been greatly expanded.在Loop Interchange方面,In the past two years, the team partners have provided more than 20 majorpatch,包含Loop Interchange,以及相关的dependence analysis、loop cache analysis、delinearizationOther analysis and optimization enhancements.简单举几个例子：

D114916 [LoopInterchange] Enable loop interchange with multiple outer loop indvars ( https://reviews.llvm.org/D114916)
D114917 [LoopInterchange] Enable loop interchange with multiple inner loop indvars
( https://reviews.llvm.org/D1149167)

这两个patch将Loop InterchangeThe application scenario is extended to include more than one in the inner or outer loopinduction variable的情况：

       
       for (c = 0, e = 1; c + e < 150; c++, e++) {
       
     d = 5;
       
 for (; d; d--)
       
       a |= b[d + e][c + 9];
       
 }
       
 }
      
1.
2.
3.
4.
5.
6.

D118073 [IVDescriptor] Get the exact FP instruction that does not allow reordering
( https://reviews.llvm.org/D118073)
D117450 [LoopInterchange] Support loop interchange with floating point reductions( https://reviews.llvm.org/D117450)

这两个patch将Loop InterchangeThe application scenario is extended to support floating-point typereduction计算的场景：

       
       double matrix[dim][dim];
       
for (i=0;i< dim; i++)
       
 for (j=0;j <dim;j++)
       
            matrix[i][j] += 1.0;
      
1.
2.
3.
4.

D120386 [LoopInterchange] Try to achieve the most optimal access pattern after interchange

( https://reviews.llvm.org/D120386)

这个patch增强了InterchangeThe ability to enable the compiler to convert the loop bodypermutebecome the globally optimal round-robin order：

       
       void f(int e[100][100][100], int f[100][100][100]) {
       
 for (int a = 0; a < 100; a++) {
       
 for (int b = 0; b < 100; b++) {
       
 for (int c = 0; c < 100; c++) {
       
        f[c][b][a] = e[c][b][a];
       
 }
       
 }
       
 }
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.

       
       void f(int e[100][100][100], int f[100][100][100]) {
       
 for (int c = 0;  c < 100; c++) {
       
 for (int b = 0; b < 100; b++) {
       
 for (int a = 0; a < 100; a++) {
       
        f[c][b][a] = e[c][b][a];
       
 }
       
 }
       
 }
       
}
      
1.
2.
3.
4.
5.
6.
7.
8.
9.

D124926 [LoopInterchange] New cost model for loop interchange

( https://reviews.llvm.org/D124926)

这个patch为loop interchange提供了一个全新的,功能更强的cost model,more accuratelyloop interchange的profitability做出判断.

此外,We also provide the community with a lot ofbugfix的patch:

D102300 [LoopInterchange] Check lcssa phis in the inner latch in scenarios of multi-level nested loops
D101305 [LoopInterchange] Fix legality for triangular loops
D100792 [LoopInterchange] Handle lcssa PHIs with multiple predecessors
D98263 [LoopInterchange] fix tightlyNested() in LoopInterchange legality
D98475 [LoopInterchange] Fix transformation bugs in loop interchange
D102743 [LoopInterchange] Handle movement of reduction phis appropriately during transformation (pr43326 && pr48212)
D128877 [LoopCacheAnalysis] Fix a type mismatch bug in cost calculation

and other enhancements：

D115238 [LoopInterchange] Remove a limitation in legality
D118102 [LoopInterchange] Detect output dependency of a store instruction with itself
D123559 [DA] Refactor with a better API
D122776 [NFC][LoopCacheAnalysis] Add a motivating test case for improved loop cache analysis cost calculation
D124984 [NFC][LoopCacheAnalysis] Update test cases to make sure the outputs follow the right order
D124725 [NFC][LoopCacheAnalysis] Use stable_sort() to avoid non-deterministic print output
D127342 [TargetTransformInfo] Added an option for the cache line size
D124745 [Delinearization] Refactoring of fixed-size array delinearization
D122857 [LoopCacheAnalysis] Enable delinearization of fixed sized arrays

结语

If you want to use as much as possibleLoop Interchange优化,that's writingC/C++代码时,Please ensure access to the array or sequence between each iteration as much as possiblestride越小越好;stride越接近1,The higher the spatial locality,自然cacheThe hit rate will also be higher,You can also get more ideal benefits in terms of performance data.另外,由于C/C++的存储方式为Row-major ordering,So when accessing a multidimensional array,Note that the inner loop should beColumnto get smallerstride.