pyspark🍒🥭 is delicious,just eat it!😋😋

Overview

如何用10天吃掉pyspark? 🔥 🔥

《10天吃掉那只pyspark》

《20天吃掉那只Pytorch》

《30天吃掉那只TensorFlow2》

一,pyspark 🍎 or spark-scala 🔥 ?

pyspark强于分析,spark-scala强于工程。

如果应用场景有非常高的性能需求,应该选择spark-scala.

如果应用场景有非常多的可视化和机器学习算法需求,推荐使用pyspark,可以更好地和python中的相关库配合使用。

此外spark-scala支持spark graphx图计算模块,而pyspark是不支持的。


pyspark学习曲线平缓,spark-scala学习曲线陡峭。

从学习成本来说,spark-scala学习曲线陡峭,不仅因为scala是一门困难的语言,更加因为在前方的道路上会有无尽的环境配置痛苦等待着读者。

而pyspark学习成本相对较低,环境配置相对容易。从学习成本来说,如果说pyspark的学习成本是3,那么spark-scala的学习成本大概是9。

如果读者有较强的学习能力和充分的学习时间,建议选择spark-scala,能够解锁spark的全部技能,并获得最优性能,这也是工业界最普遍使用spark的方式。

如果读者学习时间有限,并对Python情有独钟,建议选择pyspark。pyspark在工业界的使用目前也越来越普遍。


二,本书 📚 面向读者 🤗

本书假定读者具有基础的的Python编码能力,熟悉Python中numpy, pandas库的基本用法。

并且假定读者具有一定的SQL使用经验,熟悉select,join,group by等sql语法。

对于Python基础不是非常扎实的读者,可以参考《3小时Python入门》文章。

《3小时Python入门》

对于numpy和Pandas不甚了解的读者,可以参考 《3小时入门numpy,pandas,matplotlib》文章。

《3小时入门numpy,pandas,matplotlib》


三,本书写作风格 🍉

本书是一本对人类用户极其友善的pyspark入门工具书,Don't let me think是本书的最高追求。

本书主要是在参考spark官方文档,并结合作者学习使用经验基础上整理总结写成的。

不同于Spark官方文档的繁冗断码,本书在篇章结构和范例选取上做了大量的优化,在用户友好度方面更胜一筹。

本书按照内容难易程度、读者检索习惯和spark自身的层次结构设计内容,循序渐进,层次清晰,方便按照功能查找相应范例。

本书在范例设计上尽可能简约化和结构化,增强范例易读性和通用性,大部分代码片段在实践中可即取即用。

如果说通过学习spark官方文档掌握pyspark的难度大概是5,那么通过本书学习掌握pyspark的难度应该大概是2.

仅以下图对比spark官方文档与本书《10天吃掉那只pyspark》的差异。


四,本书学习方案

1,学习计划

本书是作者利用工作之余大概1个月写成的,大部分读者应该在10天可以完全学会。

预计每天花费的学习时间在30分钟到2个小时之间。

当然,本书也非常适合作为pyspark的工具手册在工程落地时作为范例库参考。

点击学习内容蓝色标题即可进入该章节。

日期 学习内容 内容难度 预计学习时间 更新状态
  一、基础篇      
day1 1-1,快速搭建你的Spark开发环境 ⭐️ ⭐️ 1hour
day2 1-2,1小时看懂Spark的基本原理 ⭐️ ⭐️ ⭐️ 1hour
  二、核心篇      
day3 2-1,2小时入门Spark之RDD编程 ⭐️ ⭐️ ⭐️ 2hour
day4 2-2,7道RDD编程练习题 ⭐️ ⭐️ ⭐️ 1hour
day5 2-3,2小时入门SparkSQL编程 ⭐️ ⭐️ ⭐️ 2hour
day6 2-4,7道SparkSQL编程练习题 ⭐️ ⭐️ ⭐️ 1hour
  三、进阶篇      
day7 3-1,Spark性能调优方法 ⭐️ ⭐️ ⭐️ ⭐️ ⭐️ 2hour
day8 3-2,RDD和SparkSQL综合应用 ⭐️ ⭐️ ⭐️ ⭐️ ⭐️ 2hour
  四、拓展篇      
day9 4-1,探索MLlib机器学习 ⭐️ ⭐️ ⭐️ ⭐️ 2hour
day10 4-2,初识StructuredStreaming ⭐️ ⭐️ ⭐️ ⭐️ 2hour

2,学习环境

本书全部源码在jupyter中编写测试通过,建议通过git克隆到本地,并在jupyter中交互式运行学习。

为了直接能够在jupyter中打开markdown文件,建议安装jupytext,将markdown转换成ipynb文件。

为简单起见,本书按照如下2个步骤配置单机版spark3.0.1环境进行练习。

step1: 安装java8

jdk下载地址:https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

java安装教程:https://www.runoob.com/java/java-environment-setup.html

step2: 安装pyspark,findspark

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pyspark

pip install findspark

此外,也可以在kesci云端notebook中直接运行pyspark

https://www.kesci.com/home/project

import findspark

#指定spark_home,指定python路径
spark_home = "/Users/liangyun/anaconda3/lib/python3.7/site-packages/pyspark"
python_path = "/Users/liangyun/anaconda3/bin/python"
findspark.init(spark_home,python_path)

import pyspark 
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("test").setMaster("local[4]")
sc = SparkContext(conf=conf)

print("spark version:",pyspark.__version__)
rdd = sc.parallelize(["hello","spark"])
print(rdd.reduce(lambda x,y:x+' '+y))
spark version: 3.0.1
hello spark

除了以上方法外,也可以参考1-1节中介绍的其它方法。

1-1,快速搭建你的Spark开发环境


五,鼓励和联系作者

如果本书对你有所帮助,想鼓励一下作者,记得给本项目加一颗星星star ⭐️ ,并分享给你的朋友们喔 😊 !

如果对本书内容理解上有需要进一步和作者交流的地方,欢迎在公众号"算法美食屋"下留言。作者时间和精力有限,会酌情予以回复。

也可以在公众号后台回复关键字:spark加群,加入spark和大数据读者交流群和大家讨论。

image.png


Owner
lyhue1991
dream-->design-->deliever😋😋
lyhue1991
The easiest way to use deep metric learning in your application. Modular, flexible, and extensible. Written in PyTorch.

News December 27: v1.1.0 New loss functions: CentroidTripletLoss and VICRegLoss Mean reciprocal rank + per-class accuracies See the release notes Than

Kevin Musgrave 5k Jan 05, 2023
Fast Neural Representations for Direct Volume Rendering

Fast Neural Representations for Direct Volume Rendering Sebastian Weiss, Philipp Hermüller, Rüdiger Westermann This repository contains the code and s

Sebastian Weiss 20 Dec 03, 2022
Codes of paper "Unseen Object Amodal Instance Segmentation via Hierarchical Occlusion Modeling"

Unseen Object Amodal Instance Segmentation (UOAIS) Seunghyeok Back, Joosoon Lee, Taewon Kim, Sangjun Noh, Raeyoung Kang, Seongho Bak, Kyoobin Lee This

GIST-AILAB 92 Dec 13, 2022
Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

Maxime Burchi 145 Dec 30, 2022
ICSS - Interactive Continual Semantic Segmentation

Presentation This repository contains the code of our paper: Weakly-supervised c

Alteia 9 Jul 23, 2022
YOLOV4运行在嵌入式设备上

在嵌入式设备上实现YOLO V4 tiny 在嵌入式设备上实现YOLO V4 tiny 目录结构 目录结构 |-- YOLO V4 tiny |-- .gitignore |-- LICENSE |-- README.md |-- test.txt |-- t

Liu-Wei 6 Sep 09, 2021
SMD-Nets: Stereo Mixture Density Networks

SMD-Nets: Stereo Mixture Density Networks This repository contains a Pytorch implementation of "SMD-Nets: Stereo Mixture Density Networks" (CVPR 2021)

Fabio Tosi 115 Dec 26, 2022
Official Implementation of "Learning Disentangled Behavior Embeddings"

DBE: Disentangled-Behavior-Embedding Official implementation of Learning Disentangled Behavior Embeddings (NeurIPS 2021). Environment requirement The

Mishne Lab 12 Sep 28, 2022
Keras-tensorflow implementation of Fully Convolutional Networks for Semantic Segmentation(Unfinished)

Keras-FCN Fully convolutional networks and semantic segmentation with Keras. Models Models are found in models.py, and include ResNet and DenseNet bas

645 Dec 29, 2022
Open AI's Python library

OpenAI Python Library The OpenAI Python library provides convenient access to the OpenAI API from applications written in the Python language. It incl

Pavan Ananth Sharma 3 Jul 10, 2022
Pixel-level Crack Detection From Images Of Levee Systems : A Comparative Study

PIXEL-LEVEL CRACK DETECTION FROM IMAGES OF LEVEE SYSTEMS : A COMPARATIVE STUDY G

Manisha Panta 2 Jul 23, 2022
Ontologysim: a Owlready2 library for applied production simulation

Ontologysim: a Owlready2 library for applied production simulation Ontologysim is an open-source deep production simulation framework, with an emphasi

10 Nov 30, 2022
Automatic voice-synthetised summaries of latest research papers on arXiv

PaperWhisperer PaperWhisperer is a Python application that keeps you up-to-date with research papers. How? It retrieves the latest articles from arXiv

Valerio Velardo 124 Dec 20, 2022
Active Offline Policy Selection With Python

Active Offline Policy Selection This is supporting example code for NeurIPS 2021 paper Active Offline Policy Selection by Ksenia Konyushkova*, Yutian

DeepMind 27 Oct 15, 2022
This repository contains source code for the Situated Interactive Language Grounding (SILG) benchmark

SILG This repository contains source code for the Situated Interactive Language Grounding (SILG) benchmark. If you find this work helpful, please cons

Victor Zhong 17 Nov 27, 2022
Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

Image Classification Project Killer in PyTorch This repo is designed for those who want to start their experiments two days before the deadline and ki

349 Dec 08, 2022
Fader Networks: Manipulating Images by Sliding Attributes - NIPS 2017

FaderNetworks PyTorch implementation of Fader Networks (NIPS 2017). Fader Networks can generate different realistic versions of images by modifying at

Facebook Research 753 Dec 23, 2022
Code for "Optimizing risk-based breast cancer screening policies with reinforcement learning"

Tempo: Optimizing risk-based breast cancer screening policies with reinforcement learning Introduction This repository was used to develop Tempo, as d

Adam Yala 12 Oct 11, 2022
This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

ObjProp Introduction This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Insta

Anirudh S Chakravarthy 6 May 03, 2022
Simple API for UCI Machine Learning Dataset Repository (search, download, analyze)

A simple API for working with University of California, Irvine (UCI) Machine Learning (ML) repository Table of Contents Introduction About Page of the

Tirthajyoti Sarkar 223 Dec 05, 2022