当前位置:网站首页>爬演员名字加链接
爬演员名字加链接
2022-04-22 04:37:00 【tonyaqiqi】
import re
import requests
import json
import pandas
import os
import sys
from bs4 import BeautifulSoup
#获取请求
def getHTMLText(url,kv):
try:
r = requests.get(url, headers=kv)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except Exception as e:
print(e)
#解析出演员姓名与链接数据并存入文件
def parserData(text):
soup = BeautifulSoup(text,'lxml')
review_list = soup.find_all('li',{
'class':'pages'})
soup1 = BeautifulSoup(str(review_list),'lxml')
all_dts = soup1.find_all('dt')
stars = []
i=0
for dt in all_dts:
star = {
}
try:
print(dt.find('a').text)
star["name"] = dt.find('a').text
star["link"] = 'https://baike.baidu.com' + dt.find('a').get('href')
stars.append(star)
except Exception as e:
continue
i+=1
print(i)
json_data = json.loads(str(stars).replace("\'","\""))
with open('zhifou.json','w',encoding='UTF-8') as f:
json.dump(json_data,f,ensure_ascii=False)
if __name__ == '__main__':
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'
}
url = 'https://baike.baidu.com/item/%E7%9F%A5%E5%90%A6%E7%9F%A5%E5%90%A6%E5%BA%94%E6%98%AF%E7%BB%BF%E8%82%A5%E7%BA%A2%E7%98%A6/20485668?fr=aladdin'
text=getHTMLText(url, headers)
parserData(text)
print("所有信息爬取完成!")
版权声明
本文为[tonyaqiqi]所创,转载请带上原文链接,感谢
https://blog.csdn.net/wtyttdy/article/details/116208142
边栏推荐
- [logical fallacies in life] right for people, wrong for things and dilemma trap
- sqlilabs(25a-26)
- 06-Datetimes
- If you want to change your career to take the test, I advise you to understand these contents first
- 将矩阵转换为稀疏矩阵,再将稀疏矩阵转换为矩阵(第一篇)
- 软件测试成行业“薪”贵?
- 队列总结(第一篇)
- How do I test the shuttle application? Unit test
- Solve the problem that the neo4j browser displays blank circles or non target attributes after importing nodes
- 11.libevent对水平触发和边缘触发测试
猜你喜欢

论文阅读 (47):DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology..

Intelligent power safety management system

H7-tool releases firmware v2 15. For offline recording, the full series SPI flash of Renesas, Hetai and is25wp are added (2022-04-14)

Zuo Chengyun - Dachang question brushing class - the minimum number of exchanges of one character on the left and another character on the right

Unity 简单UI前缀树红点系统

06-Datetimes

JVM shorthand

SCI paper writing -- word template of IEEE Journal (also available in latex)
![[concurrent programming 043] how to solve the problems of CAS and ABA?](/img/bd/8638aa75d7d3b237d792a201e68c0f.png)
[concurrent programming 043] how to solve the problems of CAS and ABA?

同行面试分享 联想 winform方向 20220420
随机推荐
Thesis reading (47): dtfd-mil: double tier feature interpretation multiple instance learning for histopathology
Target detection - lightweight network (as of April 21, 2022)
7_ Data analysis - Evaluation
[logical fallacies in life] right for people, wrong for things and dilemma trap
How does IOT platform realize business configuration center
[experience] Why does the IP address of HP printer start with 169.254
How to use SQLite database file on SD card in Android studio
毕设-SSM校园二手书籍销售系统+论文
解决IDEA中文乱码问题(配置文件乱码)
WinPcap get device list
Memory model and namespace knowledge points summary
What are the main aspects of mobile app testing?
5_ Data analysis - Data Visualization
Cilcate environment construction
[concurrent programming 045] what is pseudo shared memory sequence conflict? How to avoid?
Solve the problem that the neo4j browser displays blank circles or non target attributes after importing nodes
想转行学测试,我劝你先了解一下这些内容
How do I test the shuttle application? Unit test
22.4.21学习感悟
requires XXX>=YYY, but you‘ll have XXXX=ZZZ which is incompatible