当前位置:网站首页>分别用BeautifulSoup和scrapy爬取某一城市天气预报
分别用BeautifulSoup和scrapy爬取某一城市天气预报
2022-08-08 21:05:00 【大脸猿】
分别用BeautifulSoup和scrapy爬取某一城市天气预报
爬取网站:中国天气网 http://www.weather.com.cn
此次我们以北京为例。
1、首先我们搜索进入到北京页面:
http://www.weather.com.cn/weather/101010100.shtml?from=cityListCmp
然后分析页面源代码构造
BeautifulSoup
from urllib import request
from bs4 import BeautifulSoup
from bs4 import UnicodeDammit
url = "http://www.weather.com.cn/weather/101010100.shtml"
try:
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
req = request.Request(url,headers=headers)
data = request.urlopen(req)
data = data.read() #爬取该网页全部内容
#print(data)
dammit = UnicodeDammit(data,["Utf-8","gbk"])
data = dammit.unicode_markup
soup = BeautifulSoup(data,"lxml")
lis = soup.select("ul[class='t clearfix'] li") # [tagName][attName[=value]]
# print(lis) # 查找到li所有内容
for li in lis:
try:
data1 = li.select('h1')[0].text #日期
weather = li.select("p[class='wea']")[0].text #天气
tem = li.select("p[class='tem' i]")[0].text #温度
print(data1+" "+weather+" "+tem+"\n")
except Exception as e1:
print(e1)
except Exception as e2:
print(e2)
scrapy
(其他步骤省略)
# -*- coding: utf-8 -*-
import scrapy
from ..items import TqpcItem
from scrapy.http import Request
class TqSpider(scrapy.Spider):
name = 'tq'
allowed_domains = ['weather.com']
#start_urls = ['http://weather.com/']
header = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"}
def start_requests(self):
url = "http://www.weather.com.cn/weather/101010100.shtml"
yield Request(url, callback=self.parse)
def parse(self, response):
for i in range(0,8):
item = TqpcItem()
item["day"] = response.xpath("//ul[@class='t clearfix']/li/h1/text()")[i].extract()
item["w1"] = response.xpath("//ul[@class='t clearfix']/li/p[@class='wea']/text()")[i].extract()
item["w2"] = response.xpath("//ul[@class='t clearfix']/li/p[@class='tem']/span/text()")[i].extract()
print(item["day"]+" "+item["w1"]+" "+item["w2"])
yield item
边栏推荐
猜你喜欢
【idea_取消自动import .*】
jmeter简单压测
单片机——DHT11 温湿度传感器
Non-resolvableparent POM for xxxx: Could not find artifact xxx and ‘parent.relativePath‘ points at
GeoServer introductory study: 07 - release a larger multi-tiered TIF map data
Flask 教程 第四章:数据库
昇腾Ascend 随记 —— TensorFlow 模型迁移
快照集成(Snapshot Ensemble)
Flask 教程 第七章:错误处理
charles简单使用
随机推荐
并发和并行——从线程,线程池到任务
Fastdata极数:元宇宙报告2022——Hello Metaverse
rancher坑记录
单片机——DHT11 温湿度传感器
WebView的使用
unity报Unable to load the icon: 'CacheServerDisconnected'时的解决办法
[MEF] Chapter 04 MEF's Multi-Component Import (ImportMany) and Directory Services
jmeter简单压测
编译原理——词法分析程序(C#)
比较器? 如何使用比较器? 如何自定义比较器?
numpy
Non-resolvableparent POM for xxxx: Could not find artifact xxx and ‘parent.relativePath‘ points at
amd和Intel的CPU到底哪个好?
GeoServer入门学习:02-安装部署
同一行div或者其他行间块状标签,垂直高度不一解决办法
回调、递归、闭包、构造函数
【Life Growth】——Xiaobai's Growth Adventures
drf-树形结构的model的序列化显示
【生活成长】——小白成长历险记
编译原理——LL1分析程序实验(C#)