当前位置：网站首页>A separate machine is connected to the spark cluster of cdh, and the task is submitted remotely (absolutely successful, I have tested it n times)

A separate machine is connected to the spark cluster of cdh, and the task is submitted remotely (absolutely successful, I have tested it n times)

2022-08-09 03:12:00 【I'm going to use code to confess to the girl I like】

I have 4 machines where hadoop1-hadoop3 is a cdh cluster and hadoop4 is a normal machine.

I use a machine that does not belong to the cdh cluster to make the cdh cluster perform operations in a remote way, and my local machine does not participate in the operation.

The operation process is as follows:

In order to understand remote submission, we should learn from 2 aspects
1. Understand the principles and ideas
2. Operation

Understand the rationale

First, let's understand the basic common sense of spat
There are 4 types of spark submissions
local, standalone, yarn, memos
In addition to the local mode, othercan be submitted remotely

local is to execute with local spark, which is basically not used except for testing, and if using yarn mode or other modes, using local in the code will also cause spark-submit to submit, spark does not know theRunning in yarn mode is still executed in local mode, resulting in an error.Therefore, local, we only use it in the idea test code. If we want to make a jar package, when we need to schedule yarn mode or other modes when using submit, we will delete the master line in the idea code, and then package it.

Standalone means that it does not rely on external plug-ins and relies solely on the spark cluster for tasks. We submit it remotely through master=spark://sparkmaster node:7077
.The code is separated by a , number.
In this case, the startup mode of the spark cluster must have master and worker processes, otherwise it cannot connect to this node.By default, cdh is in yarn mode. You need to start the master and work yourself, which is the master-slave architecture of the spark cluster.

The yarn mode cdh mode is also the most domestic mode.The advantage is that yarn automatically allocates resources and memory. Of course, you can also allocate resources yourself.
Yarn is divided into two types, client mode and cluster mode
These two types are actually allocated to the cluster to run, and yarn defaults to client mode, which can be changed by setting yarn-mode.
client mode is to start the driver from the current node and submit the task to the cluster for execution. You can see the log on the current node

cluster mode is submitted by the current node to the cluster, and yarn randomly assigns a machine as the driver, and then the driver submits the task to the cluster for execution. The current machine cannot see the log, only the log can be seen in the yarn service

Both of these modes can actually be used in a production environment.

Memos mode, I heard from them that the efficiency is also very high, and it is more friendly (convenient) for remote operations, but unfortunately in China, there is very little information.I haven't found any relevant study materials for him here, and I haven't conducted experiments, so I won't go into details here.

Take action

Here, we are using a machine, I installed spark and hadoop for it, and strive to be the same as the cdh version, so I downloaded spark2.4 and hadoop3.0, because cdh6.3.2 is also this version.

Of course, you have to install jdk first, and hadoop depends on java.

The process is as follows:

spark-submit yarn submits a task, he will read yarn-site.xml and other configurations in the HADOOP_HOME directory, and then connect to the worker node (node on the cluster) corresponding to the configuration through the driver, and then execute the task,Synchronize logs to this node.

The idea of remote processing is as follows:

Copy the Hadoop-related configuration files of cdh to the local hadoop directory and replace it.Since there is no configuration of its own node in the local hadoop directory, after submitting the cluster, his calculation will not be allocated to the local machine for calculation.However, since the driver is local, the processing log information of the cluster can be obtained.

Formal operation:

1. First upload and decompress these two files to the /hadoop directory

2. Go to the /etc/hadoop directory of any machine in the cdh cluster and copy it (this is the hadoop configuration directory of cdh)

3. Create the etc directory locally, put the /etc/hadoop directory of the cluster into it, and make it the same directory as the cluster

4. Copy /etc/hadoop (the current one is the cdh configuration file) to the /etc directory of the local hadoop software to overwrite hadoop. Back up the original one first.The orange part is local, the installed hadoop.You can only use cp but not mv, because the default configuration of cdh will find other configurations in the /etc/hadoop directory

cp -r /etc/hadoop/ /software/hadoop-3.0.0/etc/

5. Environment variable configuration

export JAVA_HOME=/software/jdk1.8.0_251
export PATH=$PATH:${JAVA_HOME}/bin
export HADOOP_HOME=/software/hadoop-3.0.0
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/conf
export SPARK_HOME=/software/spark-2.4.0-bin-hadoop2.7
export PATH=$PATH:${SPARK_HOME}/bin
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin

source /etc/profile

Remember to configure hosts (all nodes) and turn off firewall