当前位置:网站首页>Good code in the eyes of a compiler engineer: Loop Interchange
Good code in the eyes of a compiler engineer: Loop Interchange
2022-08-05 14:56:00 【51CTO】
摘要:本文将以Loop Interchange的场景为例,Describe the way you can get better performance when writing code.
本文分享自华为云社区《 Good code in the eyes of a compiler engineer(1):Loop Interchange》,作者:Bi Sheng's assistant.
编者按:C/C++代码在编译时,The compiler translates the source code intoCPURecognized sequence of instructions and generate executable code,The efficiency of the final code depends on the executable code generated by the compiler.在大部分情况下,The degree to which the program can be optimized by the compiler is determined when the source code is written.Even minor changes to source code can have a significant impact on how efficiently the code generated by the compiler runs.因此,Source code optimization can help the compiler to generate more efficient executable code to a certain extent.
本文将以Loop Interchange的场景为例,Describe the way you can get better performance when writing code.
1、Loop Interchange 相关基本概念
1.1 访问局部性
Access locality refers to when an application accesses memory in computer science,倾向于访问内存中较为靠近的值.This locality is a predictable behavior that occurs in computer systems,We can take advantage of this strong access locality of the system for performance optimization.访问局部性分为三种基本形式,temporal locality、空间局部性、sequential locality.
This article mainly discusses theLoop InterchangeMainly using the spatial locality.空间局部性指的是,最近引用过的内存位置以及其周边的内存位置容易再次被使用.more common in loops,e.g. in an array,如果第3Elements in a loop is used,Then in this cycle it is very likely that the first4个元素;If the cycle does use first4个元素,is to hit the previous iterationprefetch到的cache数据.
So for the array loop operation,The feature of spatial locality can be exploited,Ensure that two adjacent loop access to an array element in memory is more close to,i.e. when looping through elements in an arraystride越小,Corresponding performance may be optimized.
那么,How are arrays stored in memory??
1.2 Row-major 和 Column-major
Row-major 和 Column-major are two ways to store multidimensional arrays in linear storage.Elements of an array are contiguous in memory;Row-major orderingConsecutive elements representing rows are next to each other in memory,而Column-major orderingthen the consecutive elements representing the column are adjacent to each other,如下图所示.

虽然Row和ColumnThe name looks like it refers specifically to a 2D array,但是Row-major和Column-majorCan also be generalized for arrays of any dimension.
那么在C/C++中,In which way is the array stored??
举一个小例子,用cachegrind工具来展示Cusing two different forms of accessCPU的cacheLoss rate comparison.
按行访问:

按列访问:

根据上述CWhen two different accesses to the same array in the codecacheComparison of loss rates,可以说明在C/C++代码中,数组是以Row-major形式储存的.也就是说,If the previous step visiteda[i][j],那么对a[i][j+1]的访问会命中cache.So as not to perform access to main memory,而cachefaster than main memory,So follow the storage form of the corresponding programming language to make it hitcachemay lead to optimization.
As for other commonly used programming languages,Fortran、MATLABetc is the defaultColumn-major形式.
1.3 Loop Interchange
Loop InterchangeTake advantage of the system's tendency to access values that are closer in memory andC/C++ Row-major的特点,By changing the order of execution between two loops in a loop nest,Increase overall code space locality.此外,It can also enable other important transcoding,例如,Loop Reordering就是Loop InterchangeOptimisation that extends to when more than two loops are reordered.在LLVM中,Loop Interchangeneed to be enabled-mllvm -enable-loopinterchange选项启用.
2、优化示例
2.1 基础场景
Just look at the following example of a matrix operation:
原始代码:
试着把jktwo-tier cycleLoop Interchange之后的代码:
可以发现,在原始代码中,最内层的k每次迭代,CThe data to be accessed does not change,A每次访问的stride为1,high probability of hitcache,但Bdue to each visitstride为1024,almost every timecache miss.
Loop Interchange之后,jin the innermost loop,每次迭代时AThe data to be accessed does not change each time,C和B每次访问的stride为1,will have a high probability of hittingcache,cacheThe hit rate is greatly increased.
那么cacheDoes the hit rate really increase?,And what about the performance of both?
原始代码:




两者相比:
L1-dcache-loadsalmost the number of,Because the total amount of data to be accessed is about the same;
L1-dcache-load-misses所占L1-dcache-loadsproportion is going onloop interchangeAfter the code modification, the reduction is nearly10倍.
同时,Can also bring close to the performance data9.5%的性能提升.
2.2 特殊场景
当然,在实际使用时,Not all scenarios are2.1The neat can be shown inLoop Interchange场景.
as above scenario,如果Nis a very large array,那么Loop InterchangeTheoretically, it can bring greater benefits;However, due to the addition of a branch judgment in the middle of the two-layer loop,Cause could have beenLoop Interchangescenarios cannot be realized.
针对这种场景,Consider stripping the middle branch judgment logic,can be guaranteedLoop Interchange使得数组resaccess on contiguous memory;As for the intermediate judgment branch logic,可以在Loop InterchangeRewind after two layers of loops.
当然,Such source code modifications also need to be consideredcost是否值得,如果该ifBranch entry frequency is very high,Then the fallback bringscost也会较大,may need to be reconsideredLoop Interchange是否值得;反之,If the branch entry frequency is very low,那么Loop Interchangecan still bring considerable benefits.
3、Used for compilerLoop Interchange pass社区的贡献
Bi Sheng compiler team inllvm社区中对Loop Interchange pass也做出了不小的贡献.团队从legality、profitability等方面对Loop Interchange passcomprehensive enhancement,also thepassThe supported scenarios have been greatly expanded.在Loop Interchange方面,In the past two years, the team partners have provided more than 20 majorpatch,包含Loop Interchange,以及相关的dependence analysis、loop cache analysis、delinearizationOther analysis and optimization enhancements.简单举几个例子:
- D114916 [LoopInterchange] Enable loop interchange with multiple outer loop indvars ( https://reviews.llvm.org/D114916)
- D114917 [LoopInterchange] Enable loop interchange with multiple inner loop indvars
( https://reviews.llvm.org/D1149167)
这两个patch将Loop InterchangeThe application scenario is extended to include more than one in the inner or outer loopinduction variable的情况:
- D118073 [IVDescriptor] Get the exact FP instruction that does not allow reordering
( https://reviews.llvm.org/D118073) - D117450 [LoopInterchange] Support loop interchange with floating point reductions( https://reviews.llvm.org/D117450)
这两个patch将Loop InterchangeThe application scenario is extended to support floating-point typereduction计算的场景:
D120386 [LoopInterchange] Try to achieve the most optimal access pattern after interchange
( https://reviews.llvm.org/D120386)
这个patch增强了InterchangeThe ability to enable the compiler to convert the loop bodypermutebecome the globally optimal round-robin order:
=>
D124926 [LoopInterchange] New cost model for loop interchange
( https://reviews.llvm.org/D124926)
这个patch为loop interchange提供了一个全新的,功能更强的cost model,more accuratelyloop interchange的profitability做出判断.
此外,We also provide the community with a lot ofbugfix的patch:
- D102300 [LoopInterchange] Check lcssa phis in the inner latch in scenarios of multi-level nested loops
- D101305 [LoopInterchange] Fix legality for triangular loops
- D100792 [LoopInterchange] Handle lcssa PHIs with multiple predecessors
- D98263 [LoopInterchange] fix tightlyNested() in LoopInterchange legality
- D98475 [LoopInterchange] Fix transformation bugs in loop interchange
- D102743 [LoopInterchange] Handle movement of reduction phis appropriately during transformation (pr43326 && pr48212)
- D128877 [LoopCacheAnalysis] Fix a type mismatch bug in cost calculation
and other enhancements:
- D115238 [LoopInterchange] Remove a limitation in legality
- D118102 [LoopInterchange] Detect output dependency of a store instruction with itself
- D123559 [DA] Refactor with a better API
- D122776 [NFC][LoopCacheAnalysis] Add a motivating test case for improved loop cache analysis cost calculation
- D124984 [NFC][LoopCacheAnalysis] Update test cases to make sure the outputs follow the right order
- D124725 [NFC][LoopCacheAnalysis] Use stable_sort() to avoid non-deterministic print output
- D127342 [TargetTransformInfo] Added an option for the cache line size
- D124745 [Delinearization] Refactoring of fixed-size array delinearization
- D122857 [LoopCacheAnalysis] Enable delinearization of fixed sized arrays
结语
If you want to use as much as possibleLoop Interchange优化,that's writingC/C++代码时,Please ensure access to the array or sequence between each iteration as much as possiblestride越小越好;stride越接近1,The higher the spatial locality,自然cacheThe hit rate will also be higher,You can also get more ideal benefits in terms of performance data.另外,由于C/C++的存储方式为Row-major ordering,So when accessing a multidimensional array,Note that the inner loop should beColumnto get smallerstride.
参考
[1] https://zhuanlan.zhihu.com/p/455263968[2] https://valgrind.org/info/tools.html#cachegrind
[3] https://blog.csdn.net/gengshenghong/article/details/7225775
[4] https://en.wikipedia.org/wiki/Loop_interchange
[5] https://johnysswlab.com/loop-optimizations-how-does-the-compiler-do-it/#footnote_6_1738
[6] https://blog.csdn.net/PCb4jR/article/details/85241114
[7] https://blog.csdn.net/Darlingqiang/article/details/118913291
[8] https://en.wikipedia.org/wiki/Row-_and_column-major_order
[9] https://en.wikipedia.org/wiki/Locality_of_reference
边栏推荐
猜你喜欢

LeetCode Question of the Day (1706. Where Will the Ball Fall)

Today's sleep quality record 78 points

如何用一条命令将网页转成电脑 App

Postgresql源码(67)LWLock锁的内存结构与初始化

Jmeter接口测试响应数据中文显示为Unicode码的解决方法

The Hyper - V virtualization vmware data recovery 】 【 file is missing, virtualization server unavailable data recovery case

抖音自媒体运营的5个技巧,让你的账号快速涨粉

抖音自媒体账号被限流?这3种方法教你如何鉴别

PC端浏览器兼容

Advanced roadmap for technical salary increase of full-stack software test engineers (with information)
随机推荐
2022最新综述 | 面向大规模场景的小目标检测:综述和 benchmark
アィシャ / 艾夏
addSrouce(sourceFunction),但是我MysqlSource.build之后没有
自媒体爆文如何写作?学会这4点,能让你快速提升阅读量
LeetCode Question of the Day (1706. Where Will the Ball Fall)
7 RESTful
兆骑科创高层次人才创业赛事活动,创新创业人才引进平台
概率论基础 - 6 - 切比雪夫不等式
Orthogonal-Uncorrelated-Independent
Treasure Project(藏宝计划)冲刺百倍!
Taurus.MVC WebAPI 入门开发教程3:路由类型和路由映射。
概率论基础 - 1 - 基础概念
d rebinding unchanged
-ST table template
广发期货手机开户是安全的吗?
Burp Suite的代理Brup Proxy的使用详解
CRM giant loses China, Salesforce China will be disbanded?
Design of Fingerprint Time Attendance Machine + Host Computer Management Based on STM32 MCU
马氏距离 (马哈拉诺比斯距离) (Mahalanobis distance)
Flexible and easy-to-use sql monitoring script part2