当前位置:网站首页>Handwritten Google's first generation distributed computing framework MapReduce

Handwritten Google's first generation distributed computing framework MapReduce

2022-04-23 20:15:00 nobody234

6.824 lab1 Handwriting Google First generation distributed computing framework MapReduce

lab1 For almost a week , From the first look at a loss , I have no idea what to do , I learned a little go After reading the knowledge of the paper again, sort out the ideas and design the code , Now I gradually have my own understanding of this framework , It's a really disgusting process (java Contestant's initial writing go It made me sick when I was , But now I feel go It's really a very good language , Xi xi xi ), thank XiyouLinux The partners of the interest group , It's really hard for them to stay alone .

Here is the text

----------------------------------- Split line -------------------------------------------

1. MapReduce What is it , Will it be better than my yogurt bun ?

MapReduce yes Google In succession 2003 Google distributed file system was published in (GFS) After the paper , On 2004 The second heavyweight article on distributed released in , Described in the paper MapReduce Is a distributed parallel computing framework , Popular speaking From the perspective of users ( If there is no special indication in the following content, the user refers to the framework user ),MapReduce The frame is such a thing , He gives the following two interfaces for users to use

map(key,value)
reduce(key,list[value])// this key The 1 pet. key 

The user gives the program a task , The framework enlarges and reduces the tasks formally , Then through the user-defined Map Function to split the task in connotation , Thus, a series of key value pairs are generated as return values , The return values of these key value pairs will eventually be aggregated by the framework produce key,list[value] As Reduce The input of the function is processed , After processing, it will be output to the hard disk for users to view , Of course, how to aggregate , How to call map and reduce Distributed processing , Users don't need to know how to output , Users only need to complete these two interface processing tasks to get the results .

The above words are very abstract , Of course, it is abstract to discuss any problem without a specific problem scenario , So if you don't understand it, it doesn't matter. Let's first see what problem he has solved . Look back at this part .

2. Why do we need MapReduce( Distributed computing framework ) Or what problems he solved

2.1 Let's think about these questions first

  1. What is distributed
    My definition of him is : By disassembling and distributing a huge computing or storage task , Cooperate with multiple machines through the network , Process the distributed tasks .( It's simple, but at least you can understand it )
  2. Why do we do distributed
    First of all, we can reach such a consensus , That is, when the amount of data is relatively small , What kind of distributed stuff is bullshit , Single machine performance must be better than multi machine cooperation , Only when the single machine cannot meet the demand, we need to find another way to solve the problem through distribution , It can be said that distribution is actually a helpless move ( That is to say, if those who engage in hardware can help us create machines with unlimited memory and unlimited computing power , There is no need for us who engage in software to optimize , So it's all their fault !!! dog's head )
  3. Why Google First MapRduce Not me
    It's simple because I don't have this need , Google as a universal search engine in the world , He not only crawls a huge amount of Web information on the Internet every day , And save this information , After saving, you have to build an index and analyze the content , To be sure, a single machine can't meet his needs , He's definitely going to do it , But you know , We programmers usually sell very expensive , And we always like fishing, which leads to a long project development cycle ,Google Those bosses are very angry , Some of these bosses know technology, such as Larry - page ( Um. ? page ) When he understood his situation, he quickly summed up the following needs
    1. Need a kind of software with horizontal expansion ability , This lateral expansion capability can be understood as , By buying machines and equipment, we can improve the performance of the service , People don't have to rely on software to optimize
    2. This machine had better not be too expensive, just use it

Page is a technology savvy. He found that it is a coincidence that distributed systems have such capabilities , But you know , Every programmer has his own expertise , Not everyone will be distributed , So is it possible that those students who can do distributed can do something so that other students who can't do distributed can also do distributed ( A little around, ha ha ),MapReduce emerge as the times require .

Let's talk about it as a framework, which shields the details of its own distributed implementation , Distributed storage of data 、 data communication 、 Fault tolerant processing and other parallel computing involves many complex details at the bottom of the system, which are handled by the system , It greatly reduces the burden of software developers . How does he abstract to separate distributed systems from specific tasks

2.2 Our goal is to abstract map reduce

My mother likes to buy this kind of eight treasure rice and cook porridge , This kind of rice is stirred evenly by people in the supermarket and then sold , Out of curiosity, I'd like to see how much it costs , If it's expensive, buy your own ingredients and mix them , So I divided the rice my mother bought into three piles , A pile for my dad , Give me a pile , A pile for my mother , The three of them , The rice in front of them is divided into different types ( millet , Red bean , Mung bean , Chinese wolfberry , Lily …) Divided into 8 Little heap (partition), Three people are 24 Pile up , I think they'll count this 24 Small piles are synthesized according to the type 8 Little heap (shuffle), My sister calculated the price of red bean pile by multiplying the price of ingredients by the weight, and then told me the price of red bean pile ( return ) I wrote it down in my book ( Discs ), My mother is Xiaomi , My father is a mung bean .... I finally calculated how much these eight treasure rice should be sold if they were not mixed according to the records (reduce). Finally, I found that the price difference was not too much

I can't work out this result alone , Fortunately, I will. I can call my family to help , They also don't need to know what the working process of the whole system is , they (worker) as long as

  1. Perform sorting and splitting (map) Eight little piles
  2. Execute and calculate the price according to each pile (reduce) And tell me

I (master) It only needs

  1. Formally split a large pile into three small piles (nMap)
  2. take 24 Pile mixing 8 Pile up (nReduce)
  3. Write the calculation results of each pile in the book ( Discs )

In this way, we completed the calculation task happily .

I mean, all concrete things can be abstracted , If necessary , Abstract things can also be embodied , As long as the model is well built , Any concrete problem can be set on the abstract template , A large task can certainly be split in form or in task volume , But if the problem requires the aggregation of certain features to operate together , Splitting is not a mindless thing , We need to break it up formally first , Let the task get smaller first , Then aggregate from the characteristics (Map), Aggregate and then split , In this way, we can meet our calculation requirements (reduce) This is our abstraction .

3. MapReduce Frame design ideas

Now let's change our perspective , Not from the user's point of view , Instead, think about how the framework should be designed , Let's consider the following questions

  1. form
    1. Need a central control Coordinator Scheduling tasks
    2. It takes a lot of worker Perform tasks
  2. Execute the process
    1. Coordinator
      1. After startup, initialize the generation according to the provided file mapjob list
      2. take DistributeJob Method registration for external use rpc call
      3. When the outside world has a request , distribution map
      4. When mapJob Initialize after distribution reducejob list
      5. distribution reduceJob
    2. work
      1. Ask for a task
      2. Determine the task category (reduce/map)
      3. inform coorfinator The task that oneself completes , And reapply for the task
  3. Possible problems
    1. coordinator What to do if I hang up , Can't , single master,master Hang up and the system will collapse
    2. worker What to do if I hang up , Reassign what originally belonged to this worker The task of

4. Code presentation

The details are here https://github.com/nobodyw-cell/labs/tree/main/MIT_6.824/src

5. Optimized details and optimized details

  1. Concurrent processing is not particularly good , Used a lot of locks , Not based on go Of CSP Concurrency model , Use pipes to do
  2. worker inform coordinator I didn't tell you the type of task I completed when I completed the task , Easy to cause confusion

版权声明
本文为[nobody234]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204210555176566.html

随机推荐