3.6.3 爬取豆瓣读书TOP500_Python网络爬虫技术与实战-QQ阅读男生玄幻网

书名：Python网络爬虫技术与实战
作者名：赵国生王健编著
本章字数：325字
更新时间：2025-02-28 14:26:31

3.6.3　爬取豆瓣读书TOP500

如下示例使用lxml配合requests爬取豆瓣读书TOP500，实战体会lxml的综合用法：

1）用XPath确定所爬取数据的位置；

2）获取数据，将数据写到CSV文件中保存。

【例3-43】爬取豆瓣读书TOP500

1  import requests
2  from lxml import etree
3  import csv
4  import codecs
5  import sys
6  # reload(sys)
7  # sys.setdefaultencoding('utf-8')
8  # 创建CSV文件，并写入表头信息
9  fp = codecs.open('D:\h.csv','w+','utf_8_sig')
10 writer = csv.writer(fp)
11 writer.writerow(('书名','地址','作者','出版社','出版日期','价格','评分','评价'))
12 # 构造所有的URL链接
13 urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,25
14 1,25)]
15 # 添加请求头
16 headers = {
17     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.
18     36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
19 }
20 # 循环URL
21 for url in urls:
22     html = requests.get(url,headers=headers)
23     selector = etree.HTML(html.text)
24     # 取大标签，以此循环
25     infos = selector.xpath('//tr[@class="item"]')
26  # 循环获取信息
27     for info in infos:
28         name = info.xpath('td/div/a/@title')[0]
29         url = info.xpath('td/div/a/@href')[0]
30         book_infos = info.xpath('td/p/text()')[0]
31         author = book_infos.split('/')[0]
32         publisher = book_infos.split('/')[-3]
33         date = book_infos.split('/')[-2]
34         price = book_infos.split('/')[-1]
35         rate = info.xpath('td/div/span[2]/text()')[0]
36         comments = info.xpath('td/p/span/text()')
37         comment = comments[0] if len(comments) != 0 else "空"
38         # 写入数据
39         writer.writerow((name,url,author,publisher,date,price,rate,comment))
40  fp.close()

运行结果如图3-6所示。

图3-6　爬取结果