当前位置:网站首页>爬虫实例:爬取淘宝商品信息
爬虫实例:爬取淘宝商品信息
2022-04-21 13:52:00 【ddy-ddy】
步骤:
1.爬取信息
2.处理信息
3.打印信息
4.加入一个排序功能(按照价格排序)
import requests
import re
import numpy as np
def getHtml(url): #获取html文本信息
try:
headers = {
'authority': 's.taobao.com',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36',
'sec-fetch-dest': 'document',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'sec-fetch-site': 'none',
'sec-fetch-mode': 'navigate',
'sec-fetch-user': '?1',
'accept-language': 'zh-CN,zh;q=0.9',
'cookie': '_m_h5_tk=020ac4fcce054eead14bacf76bfbc623_1582381154052; _m_h5_tk_enc=2b1a7635af7e28b6820d4f178c0e484f; _samesite_flag_=true; cookie2=124a2afafe5316349cadcd3116b7efc9; t=77151394e926eb32ee8d4fee86c7da65; _tb_token_=e541eb335eee1; enc=dnu27IvZtbdvTKaJNEFu%2BHkhasRLM4%2FLorzH3oM%2BxBePPMnViDGMNZvfzoP4moTKaRdkdB27Ue6dRhKNQT985oA1ljB3hDG1PtMl7SpGt8U%3D; hng=CN%7Czh-CN%7CCNY%7C156; thw=cn; cna=dHTNFad14hUCAXOWDj8Lg4aD; uc3=nk2=F5RDKXAkqEy0Lrxn&lg2=U%2BGCWk%2F75gdr5Q%3D%3D&id2=UUphy%2FZ8VX%2Fq%2BHpq%2Fg%3D%3D&vt3=F8dBxd3xJ3ijobg3UpU%3D; csg=7b09f583; lgc=tb6550145519; dnk=tb6550145519; skt=7e7e674db83634bc; existShop=MTU4MjQyNDI4MA%3D%3D; uc4=nk4=0%40FY4I6gSsveSoZYl8XlKn9DMLv%2Bh14BU%3D&id4=0%40U2grEJGD2RFYxg7Cc6xCIkfKE51bNacH; tracknick=tb6550145519; _cc_=WqG3DMC9EA%3D%3D; tg=0; mt=ci=41_1; v=0; uc1=cookie14=UoTUOLcbM3M%2BUA%3D%3D&lng=zh_CN&cookie16=UIHiLt3xCS3yM2h4eKHS9lpEOw%3D%3D&existShop=false&cookie21=URm48syIYn73&tag=8&cookie15=W5iHLLyFOGW7aA%3D%3D&pas=0; JSESSIONID=A611CB2AF5454EE149ED367B4E1B11E5; isg=BJWVwEpoZ4ayb0N5nphhQ2NJpJdPkkmkU66Lyxc6UYxbbrVg3-JZdKMsOHJY9WFc; l=dBQYwHWnQsJIzwBbBOCanurza77OSIRYYuPzaNbMi_5K96T6a8_Oo-rmeF96VjW5TJ8B429W17p9-etXZrfTg2--g3fz4evIpfLH4',
}
r = requests.get(url, timeout=30, headers=headers)
r.raise_for_status() # 如果状态不是200,引发HTTPError异常
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def parseData(list,html): #处理信息
try:
plt = re.findall(r'"view_price":"[\d.]*"', html) #获得商品价格
tlt = re.findall(r'"raw_title":".*?"', html) #获取商品本身的名字
for i in range(len(plt)):
price=eval(plt[i].split(":")[1]) #eval将获取的字符串的冒号去掉
title=eval(tlt[i].split(":")[1])
list.append([price,title])
except:
print("获取失败")
def printData(list): #打印数据
tplt="{:4}\t{:8}\t{:16}"
print(tplt.format("序号",'价格','商品名称'))
count=0
for i in list:
i[0]=float(i[0])
new_list=np.array(list)
newlist=new_list[np.lexsort(new_list[:,::-1].T)]
for g in newlist:
count=count+1
print(tplt.format(count,g[0],g[1]))
def main():
goods='apple watch 5' #需要搜索的商品名称
depth=2 #页面总数
start_url = 'https://s.taobao.com/search?q=' + goods
infolist=[]
for i in range(depth):
try:
url = start_url + '&s=' + str(44 * i)
html=getHtml(url)
parseData(infolist,html)
except:
continue
printData(infolist)
main()
出现的问题
1.淘宝存在反爬取机制
解决方案:将header换成源码的header


网站链接:https://curl.trillworks.com
2.int,float类型装换问题
a=1.23
b=int(a)
c=int(float(a))
3.排序问题
注意!!:sort函数是以字符串的形式来排序的
第一种方法:转化为矩阵,将矩阵排序
newlist=new_list[np.lexsort(new_list[:,::-1].T)]
第二种方法:转换为字典,将字典排序
dict=dict(list)
for i in sorted(dict):
print(i,dict[i])
版权声明
本文为[ddy-ddy]所创,转载请带上原文链接,感谢
https://blog.csdn.net/weixin_45314989/article/details/104465706
边栏推荐
- 墨墨背单词--通过安装包提取它的所有离线单词
- 商用密码应用安全性评估量化评估规则(2021版)
- <2021SC@SDUSC>山东大学软件工程应用与实践JPress代码分析(五)
- word2vec和node2vec笔记(更新ing)
- Shandong University project training raspberry pie promotion plan phase II (VIII) array and ArrayList
- 2021-10-26协议
- Oracle 备份与用户解锁
- 网络端口号和协议号(大全)
- <2021SC@SDUSC>山东大学软件工程应用与实践JPress代码分析(六)
- networkx与PyG计算度数degree时需避免的坑:自环selfloop和多重边
猜你喜欢

< 2021SC@SDUSC > Application and practice of software engineering in Shandong University jpress code analysis (6)

山东大学项目实训树莓派提升计划二期(一)项目概述、树莓派简介

软件测试常见问题 开发模型 PC端qq登录测试用例 BUG的相关问题 测试用例设计的常用方法

报错:ModuleNotFoundError: No module named ‘astra‘

nfs服务,lvm扩容

2021-08-16记一次无意发现正方教务系统的bug

电力系统相关知识

Software testing common problems development model PC QQ login test case bug related problems test case design common methods

Zabbix5系列-监控海康威视摄像头 (七)

Zabbix5 series - sound alarm, mail alarm (XIV)
随机推荐
报错:ModuleNotFoundError: No module named ‘astra‘
让别人连接自己的mysql数据库,共享mysql数据库
Why should sparse adjacency matrix be written in transposed form adj in pytorch geometric_ t
2021-10-20接口测试
Vagrant detailed tutorial
Analysis of MySQL connection query cost and cost statistics
汇编语言程序设计 中国大学Mooc郑州大学 网课 测试题目和答案
< 2021SC@SDUSC Software engineering application and practice of Shandong University jpress code analysis (14)
CognitiveComputationalNeuroscienceonlineReadingClub第三季成员招募
多线程之单例
原子类的使用与原理
Zabbix5 series - monitoring HP server ILO management port (6)
各種排序的複習筆記
socket做的简单网络嗅探器
< 2021SC@SDUSC Software engineering application and practice of Shandong University jpress code analysis (13)
Shandong University project training raspberry pie promotion plan phase II (VII) objects and categories
大学英语词汇解析 中国大学mooc 华中科技大学 测验题答案
Esp32 development learning based on vscode (V): detailed explanation of user-defined event cycle and dedicated task
socket组播出现的问题记录
网络端口号和协议号(大全)