- Python网络爬虫技术与实战
- 赵国生 王健编著
- 325字
- 2025-02-28 14:26:31
3.6.3 爬取豆瓣读书TOP500
如下示例使用lxml配合requests爬取豆瓣读书TOP500,实战体会lxml的综合用法:
1)用XPath确定所爬取数据的位置;
2)获取数据,将数据写到CSV文件中保存。
【例3-43】爬取豆瓣读书TOP500
1 import requests 2 from lxml import etree 3 import csv 4 import codecs 5 import sys 6 # reload(sys) 7 # sys.setdefaultencoding('utf-8') 8 # 创建CSV文件,并写入表头信息 9 fp = codecs.open('D:\h.csv','w+','utf_8_sig') 10 writer = csv.writer(fp) 11 writer.writerow(('书名','地址','作者','出版社','出版日期','价格','评分','评价')) 12 # 构造所有的URL链接 13 urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,25 14 1,25)] 15 # 添加请求头 16 headers = { 17 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537. 18 36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36' 19 } 20 # 循环URL 21 for url in urls: 22 html = requests.get(url,headers=headers) 23 selector = etree.HTML(html.text) 24 # 取大标签,以此循环 25 infos = selector.xpath('//tr[@class="item"]') 26 # 循环获取信息 27 for info in infos: 28 name = info.xpath('td/div/a/@title')[0] 29 url = info.xpath('td/div/a/@href')[0] 30 book_infos = info.xpath('td/p/text()')[0] 31 author = book_infos.split('/')[0] 32 publisher = book_infos.split('/')[-3] 33 date = book_infos.split('/')[-2] 34 price = book_infos.split('/')[-1] 35 rate = info.xpath('td/div/span[2]/text()')[0] 36 comments = info.xpath('td/p/span/text()') 37 comment = comments[0] if len(comments) != 0 else "空" 38 # 写入数据 39 writer.writerow((name,url,author,publisher,date,price,rate,comment)) 40 fp.close()
运行结果如图3-6所示。
图3-6 爬取结果