3.6.3 爬取豆瓣读书TOP500

如下示例使用lxml配合requests爬取豆瓣读书TOP500,实战体会lxml的综合用法:

1)用XPath确定所爬取数据的位置;

2)获取数据,将数据写到CSV文件中保存。

【例3-43】爬取豆瓣读书TOP500


1  import requests
2  from lxml import etree
3  import csv
4  import codecs
5  import sys
6  # reload(sys)
7  # sys.setdefaultencoding('utf-8')
8  # 创建CSV文件,并写入表头信息
9  fp = codecs.open('D:\h.csv','w+','utf_8_sig')
10 writer = csv.writer(fp)
11 writer.writerow(('书名','地址','作者','出版社','出版日期','价格','评分','评价'))
12 # 构造所有的URL链接
13 urls = ['https://book.douban.com/top250?start={}'.format(str(i)) for i in range(0,25
14 1,25)]
15 # 添加请求头
16 headers = {
17     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.
18     36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
19 }
20 # 循环URL
21 for url in urls:
22     html = requests.get(url,headers=headers)
23     selector = etree.HTML(html.text)
24     # 取大标签,以此循环
25     infos = selector.xpath('//tr[@class="item"]')
26  # 循环获取信息
27     for info in infos:
28         name = info.xpath('td/div/a/@title')[0]
29         url = info.xpath('td/div/a/@href')[0]
30         book_infos = info.xpath('td/p/text()')[0]
31         author = book_infos.split('/')[0]
32         publisher = book_infos.split('/')[-3]
33         date = book_infos.split('/')[-2]
34         price = book_infos.split('/')[-1]
35         rate = info.xpath('td/div/span[2]/text()')[0]
36         comments = info.xpath('td/p/span/text()')
37         comment = comments[0] if len(comments) != 0 else "空"
38         # 写入数据
39         writer.writerow((name,url,author,publisher,date,price,rate,comment))
40  fp.close()

运行结果如图3-6所示。

图3-6 爬取结果