爬虫——网页解析利器--re & xpath - 好文

正则解析模块re

re模块使用流程

方法一

r_list=re.findall('正则表达式',html,re.S)

方法二创建正则编译对象

pattern = re.compile('正则表达式',re.S)
r_list = pattern.findall(html)

正则表达式元字符：https://www.cnblogs.com/LXP-Never/p/9522475.html
<https://www.cnblogs.com/LXP-Never/p/9522475.html>

类别元字符
匹配字符 . [...] [^...] \d \D \w \W \s \S
匹配重复 * + ? {n} {m,n}
匹配位置 ^ $ \A \Z \b \B
其他 | () \
匹配任意一个字符的正则表达式
import re pattern = re.compile('.',re.S) # 方法一 pattern = re.compile('[\s\S]') #
方法二
贪婪匹配和非贪婪匹配

贪婪匹配（默认）

* 在整个表达式匹配成功的前提下,尽可能多的匹配 * + ?
* 表示方式： .* .+ .?
非贪婪匹配

* 在整个表达式匹配成功的前提下,尽可能少的匹配 * + ?
* 表示方式： .*? .+? .??
正则表达式分组

作用：在完整的模式中定义子模式，将每个圆括号中子模式匹配出来。
import re s = 'A B C D' p1 = re.compile('\w+\s+\w+') print(p1.findall(s)) # #
['A B','C D'] p2 = re.compile('(\w+)\s+\w+') print(p2.findall(s)) # # ['A','C']
p3= re.compile('(\w+)\s+(\w+)') print(p3.findall(s)) # # [('A','B'),('C','D')]
import re html = '''<div class="animal"> <p class="name"> <a title="Tiger"></a>
</p> <p class="content"> Two tigers two tigers run fast </p> </div> <div
class="animal"> <p class="name"> <a title="Rabbit"></a> </p> <p
class="content"> Small white rabbit white and white </p> </div>''' pattern =
re.compile('<div class="animal">.*?title="(.*?)".*?' 'class="content">(.*?)</p>'
, re.S) r_list= pattern.findall(html) print(r_list) View Code
分组总结

* 在网页中,想要什么内容,就加 ( )
* 先按整体正则匹配，然后再提取分组()中的内容
* 如果有2个及以上分组(),则结果中以元组形式显示 [(),(),()]
xpath解析

XPath即为XML路径语言，它是一种用来确定XML文档中某部分位置的语言，同样适用于HTML文档的检索，我们来利用xpath对HTML代码进行检索试试，以下是HTML示例代码。
<ul class="book_list"> <li> <title class="book_001">Harry Potter</title> <
author>J K. Rowling</author> <year>2005</year> <price>69.99</price> </li> <li>
<title class="book_002">Spider</title> <author>Forever</author> <year>2019</year
> <price>49.99</price> </li> </ul>
匹配演示

1、查找所有的li节点
//li
2、查找li节点下的title子节点中,class属性值为'book_001'的节点
//li/title[@class="book_001"]
3、查找li节点下所有title节点的,class属性的值
//li//title/@class
只要涉及到条件，加 []
只要获取属性值，加 @

选取节点

// ：从所有节点中查找（包括子节点和后代节点）

@ ：获取属性值
# 使用场景1（属性值作为条件）　　 //div[@class="movie"] # 使用场景2（直接获取属性值）　　 //div/a/@src
匹配多路径（或）

xpath表达式1 | xpath表达式2 | xpath表达式3

contains() ：匹配属性值中包含某些字符串节点
# 查找class属性值中包含"book_"的title节点 //title[contains(@class,"book_")] #
匹配ID名含qiushi_tag_字符串的div节点 //div[contains(@id,"qiushi_tag_")]
text() ：获取节点的文本内容
# 查找所有书籍的名称 //ul[@class="book_list"]/li/title # 结果:<element title at xxxx>
//ul[@class="book_list"]/li/title/text() # 结果:'Harry Potter'
练习猫眼电影xpath信息检索：https://maoyan.com/board/4?offset=1
<https://maoyan.com/board/4?offset=1>

1、获取猫眼电影中电影信息的 dd 节点
//dl[@class="board-wrapper"]/dd
2、获取电影名称
//dl[@class="board-wrapper"]/dd//p[@class="name"]/a/text()
3、获取电影主演的
//dl[@class="board-wrapper"]/dd//p[@class="star"]/text()
4、获取上映商检的xpath
//dl[@class="board-wrapper"]/dd//p[@class="releasetime"]/text()
xpath解析库lxml

* 导模块　　from lxml import etree
* 创建解析对象　　parse_html = etree.HTML(html)
* 解析对象调用xpath，只要调用xpath，结果一定为列表　　r_list = parse_html.xpath('xpath表达式') from
lxmlimport etree html = """ <div class="wrapper"> <i class="iconfont icon-back"
id="back"></i> <a href="/" id="channel">新浪社会</a> <ul id="nav"> <li><a
href="http://domestic.firefox.sina.com/" title="国内">国内</a></li> <li><a
href="http://world.firefox.sina.com/" title="国际">国际</a></li> <li><a
href="http://mil.firefox.sina.com/" title="军事">军事</a></li> <li><a
href="http://photo.firefox.sina.com/" title="图片">图片</a></li> <li><a
href="http://society.firefox.sina.com/" title="社会">社会</a></li> <li><a
href="http://ent.firefox.sina.com/" title="娱乐">娱乐</a></li> <li><a
href="http://tech.firefox.sina.com/" title="科技">科技</a></li> <li><a
href="http://sports.firefox.sina.com/" title="体育">体育</a></li> <li><a
href="http://finance.firefox.sina.com/" title="财经">财经</a></li> <li><a
href="http://auto.firefox.sina.com/" title="汽车">汽车</a></li> </ul> <i
class="iconfont icon-liebiao" id="menu"></i> </div>""" # 问题1：获取所有 a 节点的文本内容
parse_html = etree.HTML(html) r_list = parse_html.xpath('//a/text()') print
(r_list)# ['新浪社会', '国内', '国际',.....] # 问题2：获取所有 a 节点的 href 的属性值 parse_html =
etree.HTML(html) r_list= parse_html.xpath('//a/@href') print(r_list) # ['/',
'http://domestic.firefox.sina.com/', 'http://world.firefox.sina.com/'...] #
问题3：获取所有 a 节点的href的属性值, 但是不包括 / parse_html = etree.HTML(html) r_list =
parse_html.xpath('//ul[@id="nav"]/li/a/@href') print(r_list) #
['http://domestic.firefox.sina.com/', 'http://world.firefox.sina.com/'...] #
问题4：获取图片、军事、...,不包括新浪社会 parse_html = etree.HTML(html) r_list =
parse_html.xpath('//ul[@id="nav"]/li/a/text()') print(r_list) # ['国内',
'国际',.....]
猫眼电影（xpath）

地址: 猫眼电影 - 榜单 - top100榜 https://maoyan.com/board/4 <https://maoyan.com/board/4>

目标: 电影名称、主演、上映时间

步骤：

* 确定是否为静态页面（右键-查看网页源代码，搜索关键字确认）
* 写xpath表达式
* 写程序框架
xpath表达式

1、基准xpath: 匹配所有电影信息的节点对象列表

//dl[@class="board-wrapper"]/dd

2、遍历对象列表，依次获取每个电影信息
for dd in dd_list:

　　　　遍历后继续xpath一定要以: . 开头，代表当前节点

电影名称：
dd.xpath('./a/@title')[0].strip()
电影主演：
dd.xpath('.//p[@class="star"]/text()')[0].strip()
上映时间：
dd.xpath('.//p[@class="releasetime"]/text()')[0].strip()
完整代码：
import requests from lxml import etree import time import random class
MaoyanSpider(object):def __init__(self): self.page = 1 # 用于记录页数 self.url = '
https://maoyan.com/board/4?offset={}' self.ua_list = [ 'Mozilla/5.0 (Windows NT
6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.\ 163
Safari/535.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101
Firefox/6.0', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64;
Trident/4.0; SLCC2; .NET CLR 2.0.50727; \ .NET CLR 3.5.30729; .NET CLR
3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)'] # 获取页面 def
get_page(self, url):# 每次使用随机的user-agent try: # 每次使用随机的user-agent headers = {'
User-Agent': random.choice(self.ua_list)} res = requests.get(url,
headers=headers, timeout=5) res.encoding = 'utf-8' html = res.text
self.parse_page(html)except Exception as e: print('Error') self.get_page(url) #
解析页面 def parse_page(self, html): parse_html = etree.HTML(html) # 创建解析对象 #
基准xpath节点对象列表 dd_list = parse_html.xpath('//dl[@class="board-wrapper"]/dd') item
= {} # 依次遍历每个节点对象,提取数据 if dd_list: for dd in dd_list: # ['喜剧之王']
因为返回的是列表，所以取第0个值，得到的是字符串 name_list = dd.xpath('.//p/a/@title') # 电影名称 item['name
'] = [name_list[0].strip() if name_list else None][0] star_list = dd.xpath('
.//p[@class="star"]/text()') # 电影主演 item['star'] = [star_list[0].strip() if
star_listelse None][0] time_list = dd.xpath('.//p[@class="releasetime"]/text()')
# 上映时间 item['time'] = [time_list[0].strip() if time_list else None] print(item)
else: print('No Data') # 主函数 def main(self): for offset in range(0, 31, 10): url
= self.url.format(str(offset)) self.get_page(url) print('第%d页完成' % self.page)
time.sleep(random.randint(1, 3)) self.page += 1 if __name__ == '__main__': start
= time.time() spider = MaoyanSpider() spider.main() end = time.time() print('
执行时间: %.2f' % (end - start))
链家二手房案例（xpath）

确定是否为静态

　　打开二手房页面 -> 查看网页源码 -> 搜索关键字，能够搜索到就说明，是静态页面。

xpath表达式

1、基准xpath表达式(匹配每个房源信息节点列表)

//ul[@class="sellListContent"]/li[@class="clear LOGCLICKDATA"] |
//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA LOGCLICKDATA"]

2、依次遍历后每个房源信息xpath表达式

* 名称: .//a[@data-el="region"]/text()
* 户型+面积+方位+是否精装：info_list = './/div[@class="houseInfo"]/text()'
[0].strip().split('|')
* 户型: info_list[1]
* 面积: info_list[2]
* 方位: info_list[3]
* 精装: info_list[4]
* 楼层: './/div[@class="positionInfo"]/text()'
* 区域: './/div[@class="positionInfo"]/a/text()'
* 总价: './/div[@class="totalPrice"]/span/text()'
* 单价: './/div[@class="unitPrice"]/span/text()'
代码实现
import requests from lxml import etree import time import random class
LianjiaSpider(object):def __init__(self): self.url = '
https://bj.lianjia.com/ershoufang/pg{}/' self.blog = 1 self.ua_list = [ '
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko)
Chrome/14.0.835.163 Safari/535.1', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0)
Gecko/20100101 Firefox/6.0', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT
6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET \ CLR 3.5.30729; .NET
CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; InfoPath.3)', ] def
get_html(self, url): headers= {'User-Agent': random.choice(self.ua_list)} #
尝试3次,否则换下一页地址 if self.blog <= 3: try: # 设定超时时间,超时后抛出异常,被except捕捉,继续执行此函数再次请求
res = requests.get(url=url, headers=headers, timeout=5) res.encoding = 'utf-8'
html= res.text self.parse_page(html) # 直接调用解析函数 except Exception as e: print('
再次尝试') self.blog += 1 self.get_html(url) def parse_page(self, html): parse_html
= etree.HTML(html) # li_list: [<element li at xxx>,<element li at xxx>] li_list
= parse_html.xpath('//ul[@class="sellListContent"]/li[@class="clear LOGVIEWDATA
LOGCLICKDATA"]') item = {} for li in li_list: name_list = li.xpath('
.//a[@data-el="region"]/text()') # 名称 item['name'] = [name_list[0].strip() if
name_listelse None][0] info_list = li.xpath('.//div[@class="houseInfo"]/text()')
# 户型+面积+方位+是否精装 if info_list: info_list = info_list[0].strip().split('|') if
len(info_list) == 5: item['model'] = info_list[1].strip() item['area'] =
info_list[2].strip() item['direction'] = info_list[3].strip() item['perfect'] =
info_list[4].strip() else: item['model'] = item['area'] = item['direction'] =
item['perfect'] = None else: item['model'] = item['area'] = item['direction'] =
item['perfect'] = None floor_list = li.xpath('
.//div[@class="positionInfo"]/text()') # 楼层 item['floor'] =
[floor_list[0].strip().split()[0]if floor_list else None][0] address_list =
li.xpath('.//div[@class="positionInfo"]/a/text()') # 地区 item['address'] =
[address_list[0].strip()if address_list else None][0] total_list = li.xpath('
.//div[@class="totalPrice"]/span/text()') # 总价 item['total_price'] =
[total_list[0].strip()if total_list else None][0] unit_list = li.xpath('
.//div[@class="unitPrice"]/span/text()') # 单价 item['unit_price'] =
[unit_list[0].strip()if unit_list else None][0] print(item) def main(self): for
pgin range(1, 11): url = self.url.format(pg) self.get_html(url)
time.sleep(random.randint(1, 3)) # 对self.blog进行一下初始化 self.blog = 1 if __name__
=='__main__': start = time.time() spider = LianjiaSpider() spider.main() end =
time.time()print('执行时间:%.2f' % (end - start))
Chrome浏览器安装插件

安装方法

* 把下载的相关插件（对应操作系统浏览器）后缀改为 .zip
* 打开Chrome浏览器 -> 右上角设置 -> 更多工具 -> 扩展程序 -> 点开开发者模式
* 把相关插件拖拽到浏览器中，释放鼠标即可安装
* 重启浏览器
需要安装插件

* Xpath Helper: 轻松获取HTML元素的xPath路径；打开/关闭: Ctrl+Shift+x
* Proxy SwitchyOmega: Chrome浏览器中的代理管理扩展程序
* JsonView: 格式化输出json格式数据

热门工具换一换