当前位置:网站首页>pyppeteer爬虫
pyppeteer爬虫
2022-04-23 05:46:00 【圆滚滚的程序员】
import asyncio
import pyppeteer
from user_agents import UA
from collections import namedtuple
Response = namedtuple("rs", "title url html cookies headers history status")
async def get_html(url, timeout=30):
browser = await pyppeteer.launch(headless=True, args=['--no-sandbox'])
page = await browser.newPage()
await page.setUserAgent(UA)
res = await page.goto(url, options={
'timeout': int(timeout * 1000)})
#在while循环里强行查询某元素进行等待
while not await page.querySelector('.share-box'):
pass
# 滚动到页面底部
await page.evaluate('window.scrollBy(0, window.innerHeight)')
data = await page.content()
title = await page.title()
resp_cookies = await page.cookies()
resp_headers = res.headers
resp_history = None
resp_status = res.status
response = Response(
title=title,
url=url,
html=data,
cookies=resp_cookies,
headers=resp_headers,
history=resp_history,
status=resp_status
)
return response
if __name__ == '__main__':
url_list = [
"http://gxt.hunan.gov.cn//gxt/xxgk_71033/czxx/201005/t20100528_2069234.html",
"http://gxt.hunan.gov.cn//gxt/xxgk_71033/czxx/201005/t20100528_2069221.html",
"http://gxt.hunan.gov.cn//gxt/xxgk_71033/czxx/200811/t20081111_2069210.html"
]
task = (get_html(url) for url in url_list)
loop = asyncio.get_event_loop()
results = loop.run_until_complete(asyncio.gather(*task))
for res in results:
print(res.title)
版权声明
本文为[圆滚滚的程序员]所创,转载请带上原文链接,感谢
https://blog.csdn.net/qq_39483957/article/details/107998043
边栏推荐
- Motor and drive (Qi Jinqing Edition)
- Kalman filter and inertial integrated navigation
- 深拷贝和浅拷贝的区别
- Explanation of login page
- Usage scenario of copyonwritearraylist
- [leetcode 6] zigzag transformation
- 实现一个计算m~n(m<n)之间所有整数的和的简单函数
- Solution to the trial of ycu Blue Bridge Cup programming competition in 2021
- Collection and map thread safety problem solving
- GDAL+OGR学习
猜你喜欢

-- SQL query and return limit rows

7-21日错题涉及知识点。
![[leetcode 54] spiral matrix](/img/c0/9a55a62befb783a5bfc39dc3a96cb2.png)
[leetcode 54] spiral matrix
![[leetcode 401] binary Watch](/img/a5/538caf3a1a6143a47d79d947717554.png)
[leetcode 401] binary Watch

如何安装jsonpath包

Motor and drive (Qi Jinqing Edition)

1007 go running (hdu6808) in the fourth game of 2020 Hangzhou Electric Multi school competition

檢測技術與原理

Solution to the trial of ycu Blue Bridge Cup programming competition in 2021

Robocode教程8——AdvancedRobot
随机推荐
Failure to deliver XID in Seata distributed transaction project
9.Life, the Universe, and Everything
selenium+webdriver+chrome实现百度以图搜图
爬虫效率提升方法
Customized communication between threads (reentrantlock)
Rust的闭包类型(Fn, FnMut, FnOne的区别)
Consistent hash algorithm used for redis cache load balancing
Explanation of the second I interval of 2020 Niuke summer multi school training camp
How SYSTEMd uses / etc / init D script
Rust 中的 Cell 共享可变指针
多线程爬取马可波罗网供应商数据
Qthread simple test understanding
Addition, deletion, query and modification of data
Basemap库绘制地图
檢測技術與原理
Rust:单元测试(cargo test )的时候显示 println 的输出信息
1007 go running (hdu6808) in the fourth game of 2020 Hangzhou Electric Multi school competition
[leetcode 459] duplicate substring
Example of ticket selling with reentrant lock
C # Foundation