当前位置:网站首页>A can make large data clustering method of 2000 times faster, don't poke
A can make large data clustering method of 2000 times faster, don't poke
2022-08-10 13:20:00 【Huawei cloud】
一、问题描述
National astronomical observatory has a clustering task:共11份数据,Each data extracted from a photo,包含500多万条记录,Each record is a celestial coordinates and attributes.11张“照片”Some of the celestial coordinates are repeated in,But these repeated coordinates are not identical,They will have some differences but not too far distance.Task is to take one“照片”作为基础,Find duplicate objects from other photos,The repetition of the celestial coordinates and average as the final coordinates, and the object attribute,Is very close to the distance of objects of a class to do aggregation operation,So you can get a clear and coordinate information more accurate object“照片”.
在这里插入图片描述 二、问题分析
This task is not complicated,As long as the loop basis each celestial coordinates in the photo,The distance to each celestial coordinates calculation of the other photos,No more than a certain threshold is considered to be the same object,As a kind of,Finally all celestial coordinates calculating mean in each category are the coordinates of the object.

But when I found the task of calculation by computer in computation is amazing,Base photos need cycle500多万次,Each of these celestial coordinates and with other in the photo5000More than a coordinate calculation distance,计算复杂度是500多万*5000多万,This will be an astronomical.
事实也确实如此,在实验阶段,Reduce the amount of data per photo10倍,Namely each photo of celestial coordinates for50万,用PythonWrite code to realize the method to calculate the11Photo of the clustering results need time is6.5天.According to the computational complexity to calculate,500多万的数据量,计算量是50万数据量的100倍,That takes time650天,This must be an unacceptable number.
在这里插入图片描述 同样的50万数据量,After being loaded a distributed database withSQL实现,动用了100颗CPU后,跑了3.8Hours to complete the calculation.看起来比Python快了很多倍,但Python的6.5Day is a single thread,细算下来SQLSingle-core performance as wellPython(3.8小时*100>6.5天).Huge resources consumption has been difficult to tolerate,而且计算500When more than the size of, also want to be380小时.
三、解决方案
Let's consider where can be optimized to reduce the amount of calculation.Based the celestial coordinates in the photo is must cycle,In this way can we ensure that each object is used to cluster the,Other pictures of the celestial coordinates need not traverse every time,As long as find a base near the coordinates of the celestial coordinates is ok.This type of search task is very suitable for dichotomy,It can be much less amount of calculation.
具体过程是这样的:Order of the celestial coordinates of each photo first,Using dichotomy found within the scope of a certain threshold of celestial coordinates,Thus exclude the most heavenly bodies,This is coarse screening process;With basic objects and objects in coarse sieve results calculated distance,To find a qualified result,This is a fine screen process.
在这里插入图片描述 To see the screen and fine screen method of computation,10Photo of each sort a,计算量是500万log(500万)10;Dichotomy screen,计算量是500万log(500万)10;Fine screen process,Calculation not sure,但根据经验,Screen after the result is usually not more than1万个,Coarse sieve in the amount of calculationlog(500万)还要再加1万;这样算下来,The total amount of calculation is about500万log(500万)10+500万(log(500万)+1万)10,Compared with the original method,Only one over five hundred of the original amount of calculation.

四、技术选型
方法有了,To select the program tool,Implemented with beforePython,不可否认Python很强大,A framework of the existing system of astronomy calculation,Such as the calculation method of the distance,As long as the call ready-made class library can easily calculate.
但PythonAlso has a very serious disadvantages:
PythonThere are no native dichotomy method,Third-party libraries should combinePandas来完成,Need to do some data transfer during,All of these will bring some unnecessary overhead.
Python的多线程是假多线程,In fact don't support multithreaded parallel,这也是PythonCan't become an important reason this task tools.
关系数据库的SQLAlso unable to efficiently complete.The clustering arithmetic is essentially a non equal join,Database for contour connection can also adoptHASH JOINSuch as optimization scheme to reduce the amount of calculation,But for the equivalent connecting traverse scheme can only be adopted;SQLAlso cannot be achieved in the statement above design process of complex,The monotonicity of the distance can't identify the initiative order and using dichotomy;Plus was supposed to do this kind of mathematical calculation ability is insufficient(Distance calculation involving trigonometric function);In the experimental stage in front of the soSQLSingle-core performance also run butPython的现象.
JavaA high-level language such as though can realize dichotomy,Can also be good parallel,But the code to write lengthy,开发效率过低,Will seriously affect the application of maintainability.
那么,What tools can also be used to complete the task?集算器SPL是个很好的选择,It built in many high performance algorithm(如二分法),也支持多线程并行,Code to write also simple and clear,Also provides a friendly visual debugging mechanism,能有效提高开发效率,And lower maintenance cost.
在这里插入图片描述 五、实际效果
相较于Python来说,SPLFor this task to speed up2000倍,Dichotomy to speed up500倍,The multi-thread parallel speed up again4倍(The laptopCPU只有4核),A total of speed2000倍,使用SPL完成500More than the target size of clustering task only need a few hours.
SPLThe code performance not only,And it also is not complicated,The key calculation code as long as23行.
| A | B | C | D | E | |
| 1 | =RThd | /距离阈值 | |||
| 2 | =NJob=4 | /并行线程数 | |||
| 3 | =file("BasePhoto.csv")[email protected]() | ||||
| 4 | [email protected](OtherPhotos) | /Other photos path | |||
| 5 | for A4 | =file(A4)[email protected]() | /Other photos | ||
| 6 | [email protected](OnOrbitDec) | /排序 | |||
| 7 | =B6.min(DEC) | ||||
| 8 | =delta_ra=F(B7,RThd) | /根据DEC算RA阈值 | |||
| 9 | =FK(B5,NJob) | /Data index fragmentation | |||
| 10 | fork B9 | =B5(B10) | /Photo clips | ||
| 11 | for A3 | =C11.OnOrbitDec | /DEC | ||
| 12 | =D11-delta_rad | /DEC下限 | |||
| 13 | =D11+delta_rad | /DEC上限 | |||
| 14 | =C11.RA | /RA | |||
| 15 | =D14-delta_ra | /RA下限 | |||
| 16 | =D14+delta_ra | /RA上限 | |||
| 17 | [email protected]([email protected](OnOrbitDec,D12:D13)) | /二分查找DEC | |||
| 18 | =D17.select(RA>=D10&&RA<=D11) | /查找RA | |||
| 19 | =D36.select(Dis(~,C11)<=A7) | /细筛 | |||
| 20 | if D19!=[] | /合并结果 | |||
| 21 | =FC(C11,D37) | ||||
| 22 | [email protected]|B10 | /汇总结果 | |||
| 23 | =file(OFile)[email protected](B22) | /写出结果 | |||
B10格的forkIs a multithreaded parallel function,Allows section to perform the above algorithm.B6格的[email protected]()Function is parallel sorting method,Large amount of data can improve the efficiency of,The premise of orderly data dichotomy is used.C17格的[email protected](…)Function is a binary search method,Is the key to this task to speed up.
在这里插入图片描述 六、后记
Performance optimization problem relies on the high performance algorithm,Only cut down the amount of calculation can effectively improve the efficiency,And the high performance algorithm need to accumulate slowly in the work,Will anyone interested in can come here to study the performance of the commonly used optimization algorithm:Performance optimization algorithm.
High performance algorithm need efficient programming tools to realize,之前已经说过,Python、SQL、javaHas its disadvantages such as language,Or can't parallel,Either implementation difficulties、维护困难.SPLHave enough algorithm underlying support and allows high concurrency,The code can do it very simple,Also provides a friendly visual debugging mechanism,能有效提高开发效率,And lower maintenance cost.
## 七、SPL资料
High performance algorithm need efficient programming tools to realize,之前已经说过,Python、SQL、javaHas its disadvantages such as language,Or can't parallel,Either implementation difficulties、维护困难.SPLHave enough algorithm underlying support and allows high concurrency,The code can do it very simple,Also provides a friendly visual debugging mechanism,能有效提高开发效率,And lower maintenance cost.
边栏推荐
- iTextSharp操作PDF
- Detailed explanation of es6-promise object
- Codeforces Round #276 (Div. 1) D. Kindergarten
- Open Office XML 格式里如何描述多段具有不同字体设置的段落
- 【学习笔记】Redis的持久化
- Proprietary cloud ABC Stack, the real strength!
- Requirements for the construction of Loudi stem cell preparation laboratory
- ABAP 里文件操作涉及到中文字符集的问题和解决方案试读版
- Codeforces Round #276 (Div. 1) B. Maximum Value
- 九宫格抽奖动效
猜你喜欢

交换机的基础知识

实践为主,理论为辅!腾讯大佬MySQL高阶宝典震撼来袭!

Nanodlp v2.2/v3.0光固化电路板,机械开关/光电开关/接近开关的接法和系统状态电平设置

Digicert EV证书签名后出现“证书对于请求用法无效”的解决方案

DNS欺骗-教程详解

Ethernet channel Ethernet channel

jenkins数据迁移和备份

Keithley DMM7510 accurate measurement of ultra-low power consumption equipment all kinds of operation mode power consumption

代码随想录笔记_动态规划_70爬楼梯

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection 论文笔记
随机推荐
iTextSharp操作PDF
wirshark 常用操作及 tcp 三次握手过程实例分析
【jstack、jps命令使用】排查死锁
11 + chrome advanced debugging skills, learn to direct efficiency increases by 666%
【目标检测】小脚本:提取训练集图片与标签并更新索引
没有接班人,格力只剩“明珠精选”
C#报错 The ‘xmins‘ attribute is not supported in this context
22!Beijing Changping District notified catering service enterprises with food safety problems
Jiugongge lottery animation
MySQL面试题——MySQL常见查询
娄底污水处理厂实验室建设管理
代码随想录笔记_动态规划_70爬楼梯
表中存在多个索引问题? - 聚集索引,回表,覆盖索引
百度用户产品流批一体的实时数仓实践
A detailed explanation of implementation api embed
关于flask中static_folder 和 static_url_path参数理解
山水的高度
多线程下自旋锁设计基本思想
Educational Codeforces Round 41 (Rated for Div. 2) E. Tufurama
Nanodlp v2.2/v3.0 light curing circuit board, connection method of mechanical switch/photoelectric switch/proximity switch and system state level setting