当前位置:网站首页>Handwritten Google's first generation distributed computing framework MapReduce
Handwritten Google's first generation distributed computing framework MapReduce
2022-04-23 20:15:00 【nobody234】
6.824 lab1 Handwriting Google First generation distributed computing framework MapReduce
lab1 For almost a week , From the first look at a loss , I have no idea what to do , I learned a little go After reading the knowledge of the paper again, sort out the ideas and design the code , Now I gradually have my own understanding of this framework , It's a really disgusting process (java Contestant's initial writing go It made me sick when I was , But now I feel go It's really a very good language , Xi xi xi ), thank XiyouLinux The partners of the interest group , It's really hard for them to stay alone .
Here is the text
----------------------------------- Split line -------------------------------------------
1. MapReduce What is it , Will it be better than my yogurt bun ?
MapReduce yes Google In succession 2003 Google distributed file system was published in (GFS) After the paper , On 2004 The second heavyweight article on distributed released in , Described in the paper MapReduce Is a distributed parallel computing framework , Popular speaking From the perspective of users ( If there is no special indication in the following content, the user refers to the framework user ),MapReduce The frame is such a thing , He gives the following two interfaces for users to use
map(key,value)
reduce(key,list[value])// this key The 1 pet. key
The user gives the program a task , The framework enlarges and reduces the tasks formally , Then through the user-defined Map Function to split the task in connotation , Thus, a series of key value pairs are generated as return values , The return values of these key value pairs will eventually be aggregated by the framework produce key,list[value] As Reduce The input of the function is processed , After processing, it will be output to the hard disk for users to view , Of course, how to aggregate , How to call map and reduce Distributed processing , Users don't need to know how to output , Users only need to complete these two interface processing tasks to get the results .
The above words are very abstract , Of course, it is abstract to discuss any problem without a specific problem scenario , So if you don't understand it, it doesn't matter. Let's first see what problem he has solved . Look back at this part .
2. Why do we need MapReduce( Distributed computing framework ) Or what problems he solved
2.1 Let's think about these questions first
- What is distributed
My definition of him is : By disassembling and distributing a huge computing or storage task , Cooperate with multiple machines through the network , Process the distributed tasks .( It's simple, but at least you can understand it ) - Why do we do distributed
First of all, we can reach such a consensus , That is, when the amount of data is relatively small , What kind of distributed stuff is bullshit , Single machine performance must be better than multi machine cooperation , Only when the single machine cannot meet the demand, we need to find another way to solve the problem through distribution , It can be said that distribution is actually a helpless move ( That is to say, if those who engage in hardware can help us create machines with unlimited memory and unlimited computing power , There is no need for us who engage in software to optimize , So it's all their fault !!! dog's head ) - Why Google First MapRduce Not me
It's simple because I don't have this need , Google as a universal search engine in the world , He not only crawls a huge amount of Web information on the Internet every day , And save this information , After saving, you have to build an index and analyze the content , To be sure, a single machine can't meet his needs , He's definitely going to do it , But you know , We programmers usually sell very expensive , And we always like fishing, which leads to a long project development cycle ,Google Those bosses are very angry , Some of these bosses know technology, such as Larry - page ( Um. ? page ) When he understood his situation, he quickly summed up the following needs- Need a kind of software with horizontal expansion ability , This lateral expansion capability can be understood as , By buying machines and equipment, we can improve the performance of the service , People don't have to rely on software to optimize
- This machine had better not be too expensive, just use it
Page is a technology savvy. He found that it is a coincidence that distributed systems have such capabilities , But you know , Every programmer has his own expertise , Not everyone will be distributed , So is it possible that those students who can do distributed can do something so that other students who can't do distributed can also do distributed ( A little around, ha ha ),MapReduce emerge as the times require .
Let's talk about it as a framework, which shields the details of its own distributed implementation , Distributed storage of data 、 data communication 、 Fault tolerant processing and other parallel computing involves many complex details at the bottom of the system, which are handled by the system , It greatly reduces the burden of software developers . How does he abstract to separate distributed systems from specific tasks
2.2 Our goal is to abstract map reduce
My mother likes to buy this kind of eight treasure rice and cook porridge , This kind of rice is stirred evenly by people in the supermarket and then sold , Out of curiosity, I'd like to see how much it costs , If it's expensive, buy your own ingredients and mix them , So I divided the rice my mother bought into three piles , A pile for my dad , Give me a pile , A pile for my mother , The three of them , The rice in front of them is divided into different types ( millet , Red bean , Mung bean , Chinese wolfberry , Lily …) Divided into 8 Little heap (partition), Three people are 24 Pile up , I think they'll count this 24 Small piles are synthesized according to the type 8 Little heap (shuffle), My sister calculated the price of red bean pile by multiplying the price of ingredients by the weight, and then told me the price of red bean pile ( return ) I wrote it down in my book ( Discs ), My mother is Xiaomi , My father is a mung bean .... I finally calculated how much these eight treasure rice should be sold if they were not mixed according to the records (reduce). Finally, I found that the price difference was not too much
I can't work out this result alone , Fortunately, I will. I can call my family to help , They also don't need to know what the working process of the whole system is , they (worker) as long as
- Perform sorting and splitting (map) Eight little piles
- Execute and calculate the price according to each pile (reduce) And tell me
I (master) It only needs
- Formally split a large pile into three small piles (nMap)
- take 24 Pile mixing 8 Pile up (nReduce)
- Write the calculation results of each pile in the book ( Discs )
In this way, we completed the calculation task happily .
I mean, all concrete things can be abstracted , If necessary , Abstract things can also be embodied , As long as the model is well built , Any concrete problem can be set on the abstract template , A large task can certainly be split in form or in task volume , But if the problem requires the aggregation of certain features to operate together , Splitting is not a mindless thing , We need to break it up formally first , Let the task get smaller first , Then aggregate from the characteristics (Map), Aggregate and then split , In this way, we can meet our calculation requirements (reduce) This is our abstraction .
3. MapReduce Frame design ideas
Now let's change our perspective , Not from the user's point of view , Instead, think about how the framework should be designed , Let's consider the following questions
- form
- Need a central control Coordinator Scheduling tasks
- It takes a lot of worker Perform tasks
- Execute the process
- Coordinator
- After startup, initialize the generation according to the provided file mapjob list
- take DistributeJob Method registration for external use rpc call
- When the outside world has a request , distribution map
- When mapJob Initialize after distribution reducejob list
- distribution reduceJob
- work
- Ask for a task
- Determine the task category (reduce/map)
- inform coorfinator The task that oneself completes , And reapply for the task
- Coordinator
- Possible problems
- coordinator What to do if I hang up , Can't , single master,master Hang up and the system will collapse
- worker What to do if I hang up , Reassign what originally belonged to this worker The task of
4. Code presentation
The details are here https://github.com/nobodyw-cell/labs/tree/main/MIT_6.824/src
5. Optimized details and optimized details
- Concurrent processing is not particularly good , Used a lot of locks , Not based on go Of CSP Concurrency model , Use pipes to do
- worker inform coordinator I didn't tell you the type of task I completed when I completed the task , Easy to cause confusion
版权声明
本文为[nobody234]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204210555176566.html
边栏推荐
- Possible root causes include a too low setting for -Xss and illegal cyclic inheritance dependencies
- 论文写作 19: 会议论文与期刊论文的区别
- R语言使用econocharts包创建微观经济或宏观经济图、indifference函数可视化无差异曲线、自定义计算交叉点、自定义配置indifference函数的参数丰富可视化效果
- Fundamentals of programming language (2)
- R language uses the preprocess function of caret package for data preprocessing: BoxCox transform all data columns (convert non normal distribution data columns to normal distribution data and can not
- How does onlyoffice solve no route to host
- nc基础用法2
- [problem solving] 'ASCII' codec can't encode characters in position XX XX: ordinal not in range (128)
- 【数值预测案例】(3) LSTM 时间序列电量预测,附Tensorflow完整代码
- 如何在BNB链上创建BEP-20通证
猜你喜欢
SIGIR'22 "Microsoft" CTR estimation: using context information to promote feature representation learning
Grafana shares links with variable parameters
基于pytorch搭建GoogleNet神经网络用于花类识别
【文本分类案例】(4) RNN、LSTM 电影评价倾向分类,附TensorFlow完整代码
CVPR 2022 | querydet: use cascaded sparse query to accelerate small target detection under high resolution
考研英语唐叔的语法课笔记
PCL点云处理之计算两平面交线(五十一)
How to create bep-20 pass on BNB chain
网络通信基础(局域网、广域网、IP地址、端口号、协议、封装、分用)
Azkaban recompile, solve: could not connect to SMTP host: SMTP 163.com, port: 465 [January 10, 2022]
随机推荐
Record: call mapper to report null pointer Foreach > the usage of not removing repetition;
程序设计语言基础(2)
IIS data conversion problem: 16bit to 24bit
Shanda Wangan shooting range experimental platform project - personal record (IV)
NC basic usage 4
R语言使用timeROC包计算无竞争风险情况下的生存资料多时间AUC值、使用confint函数计算无竞争风险情况下的生存资料多时间AUC指标的置信区间值
[numerical prediction case] (3) LSTM time series electricity quantity prediction, with tensorflow complete code attached
php参考手册String(7.2千字)
SIGIR'22「微软」CTR估计:利用上下文信息促进特征表征学习
波场DAO新物种下场,USDD如何破局稳定币市场?
考研英语唐叔的语法课笔记
R语言ggplot2可视化:ggplot2可视化散点图并使用geom_mark_ellipse函数在数据簇或数据分组的数据点周围添加椭圆进行注释
微信中金财富高端专区安全吗,证券如何开户呢
图书管理数据库系统设计
山东大学软件学院项目实训-创新实训-网络安全靶场实验平台(六)
DTMF dual tone multi frequency signal simulation demonstration system
The R language uses the timeroc package to calculate the multi time AUC value of survival data without competitive risk, and uses the confint function to calculate the confidence interval value of mul
还在用 ListView?使用 AnimatedList 让列表元素动起来
STM32 Basics
[target tracking] pedestrian attitude recognition based on frame difference method combined with Kalman filter, with matlab code