当前位置:网站首页>scrapy爬当当网书籍信息
scrapy爬当当网书籍信息
2022-08-08 21:05:00 【大脸猿】
本次只爬取搜索的python相关的所有书籍
scrapy start project ddbook
(cd /ddbook/ddbook)
scrapy genspider -t basic book dangdang.com
然后打开 book.py
# http://search.dangdang.com/?key=Python&act=input&page_index=1#J_tab
# http://search.dangdang.com/?key=Python&act=input&page_index=2#J_tab
#一共100页
# -*- coding: utf-8 -*-
import scrapy
from ..items import DdbookItem
from scrapy.http import Request
import re
class BookSpider(scrapy.Spider):
name = 'book'
allowed_domains = ['dangdang.com']
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
# start_urls = ['http://dangdang.com/']
def start_requests(self):
for i in range(1, 2):
# for i in range(1, 101): #共100页
url = "http://search.dangdang.com/?key=Python&act=input&page_index=" + str(i) + "#J_tab"
print(url)
yield Request(url, callback=self.parse)
def parse(self, response):
for j in range(1,61): #每页60个
try:
item = DdbookItem()
author = "//ul[@class='bigimg']/li[@class='line"+str(j)+"']"
print(author)
if response.xpath(author+"//a/@title"):
item["title"] = response.xpath(author+"//a/@title")[0].extract()
print(item["title"])
else:
item["title"] = ''
if response.xpath(author+"//a[@name='itemlist-author']/text()"):
item["author"] = response.xpath(author+"//a[@name='itemlist-author']/text()")[0].extract()
print(item["author"])
else:
item["author"] = ''
if response.xpath(author + "//span[@class='search_now_price']/text()"):
item["price"] = response.xpath(author + "//span[@class='search_now_price']/text()")[0].extract()
print(item["price"])
else:
item["price"] = ''
if response.xpath(author + "//a[@name='P_cbs']/text()"):
item["press"] = response.xpath(author + "//a[@name='P_cbs']/text()")[0].extract()
print(item["press"])
else:
item["press"] = ''
if response.xpath(author + "//span/text()"):
alldata = response.xpath(author + "//span/text()").extract()
# print(alldata)
# a = len(alldata)
data = alldata[len(alldata)-2] #因为日期的位置从头数不一定是第七个,但一定是span的最后一个,又因为数组开头是0
# print(data)
# 因为出来的日期前面有个斜杠
pat = "\d{4}-\d{2}-\d{2}"
item["data"] = re.compile(pat).findall(data)[0] #[0]是为了去掉"
print(item["data"])
else:
data = ''
yield item
except Exception as e:
print(e)
边栏推荐
猜你喜欢
随机推荐
[Method for converting timestamp to normal time format]
手机投影到deepin
pm2安装配置与基本命令你知道吗?
ESLint: The Function constructor is eval. (no-new-func)错误解决
WebView的使用
[highcharts application - double pie chart]
同一行div或者其他行间块状标签,垂直高度不一解决办法
Under the Windows socket (TCP) console program
GeoServer introductory learning: 05-Multi-level MBTiles specification data release
【线性代数04】投影矩阵P和标准正交矩阵Q
小程序模拟淘宝京东商品轮播滑动展示功能模块
SQL注入之搭建dnslog
C#实现Everything——数据显示
Flask 教程 第一章:Hello, World!
Introduction to GeoServer: 01-Introduction
Socket (udp) console program under window
GeoServer入门学习:04-发布Shapfile地图数据
Jenkins下载安装
Flask 教程 第十章:邮件支持
GeoServer introductory study: 07 - release a larger multi-tiered TIF map data