当前位置：网站首页>How pyspark works

How pyspark works

2022-08-08 22:01:00 【Code_LT】

python has gained momentum in recent years, occupying the first and second positions in many programming language rankings.It is friendly to beginners, elegant programming style, and high development efficiency. These features make python the choice of many Internet industry practitioners.In particular, python's rich ecological support in the field of data science has made many software architects enter the embrace of python in order to unify the programming language in scenarios where they need to do both system architecture and data algorithms.Among the development languages supported by spark, python has a relatively high usage ratio.

When I was young when I first came into contact with spark, I saw that spark could support python for development. I took it for granted that spark should convert the python spark code we wrote into java bytecode or the underlying machine language and thenPut it on each machine node to run.Of course, this is just an idea without further research. How does spark run the code developed with pyspark.

Logical architecture diagram of spark core framework

When using scala and java to develop spark applications, as shown in the figure above, both driver and executor use JVM as a carrier to run and execute tasks.When using python to develop spark applications, in order to ensure the unity of the core architecture, spark encapsulates a layer of python around the core architecture. The core architecture functions of spark include application of computing resources, management and allocation of tasks, and between drivers and executors.The communication, the communication between executors, the RDD carrier, etc. are all based on the JVM.

On the driver side, the pyspark application written by the user communicates with the jvm Driver through py4j. The sparkcontext created in the pyspark application will be mapped to the sparkcontext object in the jvm Driver. When the rdd object created in the program is executed to the action operation, the same rddAnd actions are also mapped to the jvm Driver and executed in the jvm Driver.On the executor side, the pyspark worker is started through the pyspark daemon, and the udf and lambda functions implemented by python are executed in the worker. Based on socket communication, the executor sends data to the worker, and the worker returns the result to the executor.The carrier of rdd is still in the executor. When there is udf and lambda logic, the executor needs to communicate data with the worker.

Spark's design can be said to be very convenient for the expansion of multiple development languages.However, it can also be clearly seen that compared with the udf running inside the jvm, when the udf is executed in the python worker, the additional loss of data serialization, deserialization, and communication IO between the executor jvm and the python worker is increased, andCompared with java, python has a certain performance disadvantage in program operation.In spark tasks with a large proportion of computing logic, the pyspark program using custom udf will obviously have more performance loss.Of course using the built-in udf in spark sql reduces or removes the performance difference created in the above description.

The final choice of spark development language should be considered from the aspects of development efficiency, operation efficiency, team technology stack selection, etc., to choose the development language suitable for your team.

About resource configuration, jump to:

https://blog.csdn.net/Code_LT/article/details/123737940

Architecture details can refer to:

PysparkArchitecture Principles

原网站

版权声明
本文为[Code_LT]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/220/202208082126340865.html