【爬虫】使用 Scrapy + Selenium 爬取动态加载页面的内容 - 好文

上一篇文章里面我们使用 Python Scrapy 爬取静态网页中所有文字：
https://blog.csdn.net/sinat_40431164/article/details/81102476
<https://blog.csdn.net/sinat_40431164/article/details/81102476>

但是有个问题，当我们把要访问的URL修改为：http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2
<http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2>
的时候，可以发现爬取的内容里面没有“车型论坛”和“主题论坛”两个板块。

有时候，我们天真无邪的使用urllib库或Scrapy下载HTML网页时会发现，我们要提取的网页元素并不在我们下载到的HTML之中，尽管它们在浏览器里看起来唾手可得。

这说明我们想要的元素是在我们的某些操作下通过js事件动态生成的。举个例子，我们在刷QQ空间或者微博评论的时候，一直往下刷，网页越来越长，内容越来越多，就是这个让人又爱又恨的动态加载。爬取动态页面目前来说有两种方法：

* 分析页面请求
* selenium模拟浏览器行为
下面我们就来讲一讲如何运用Selenium模拟浏览器行为。

Creating a project

Before you start scraping, you will have to set up a new Scrapy project. Enter
a directory where you’d like to store your code and run:
scrapy startproject URLCrawler
Our first Spider

This is the code for our first Spider. Save it in a file named my_spider.py
under theURLCrawler/spiders directory in your project:
# -*- coding: utf-8 -*- """ Created on Wed Jul 18 17:55:45 2018 @author:
Administrator """ from scrapy import Spider,Request from selenium import
webdriver class MySpider(Spider): name = "my_spider" def __init__(self):
self.browser =
webdriver.Firefox(executable_path='E:\software\python\geckodriver-v0.21.0-win64\geckodriver.exe')
self.browser.set_page_load_timeout(30) def closed(self,spider): print("spider
closed") self.browser.close() def start_requests(self): start_urls =
['http://club.haval.com.cn/forum.php?mod=toutiao&mobile=2'.format(str(i)) for i
in range(1,2,2)] for url in start_urls: yield Request(url=url,
callback=self.parse) def parse(self, response): domain =
response.url.split("/")[-2] filename = '%s.html' %domain with open(filename,
'wb') as f: f.write(response.body)
print('---------------------------------------------------')
middlewares.py

加入以下内容：
from scrapy import signals from scrapy.http import HtmlResponse from
selenium.common.exceptions import TimeoutException import time class
SeleniumMiddleware(object): def process_request(self, request, spider): if
spider.name == 'my_spider': try: spider.browser.get(request.url)
spider.browser.execute_script('window.scrollTo(0, document.body.scrollHeight)')
except TimeoutException as e: print('超时')
spider.browser.execute_script('window.stop()') time.sleep(2) return
HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source,
encoding="utf-8", request=request)
settings.py

添加以下内容：
# Enable or disable downloader middlewares # See
https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = { 'URLCrawler.middlewares.SeleniumMiddleware': 543, }
How to run our spider

To put our spider to work, go to the project’s 最高一层的目录 and run:
scrapy crawl my_spider
发现下载下来的网页和用浏览器访问该网页的内容一样！

如果仅仅需要文字内容，那么将spider中的parse方法改成：
def parse(self, response): '''domain = response.url.split("/")[-2] filename =
'%s.html' %domain with open(filename, 'wb') as f: f.write(response.body)'''
#textlist_no_scripts = response.selector.xpath('//*[not(self::script or
self::style)]/text()[normalize-space(.)]').extract() textlist_with_scripts =
response.selector.xpath('//text()[normalize-space(.)]').extract() #with
open('filename_no_scripts', 'w', encoding='utf-8') as f: with
open('filename_with_scripts', 'w', encoding='utf-8') as f: for i in range(0,
len(textlist_with_scripts)): text = textlist_with_scripts[i].strip()
f.write(text + '\n')
print('---------------------------------------------------')
The End.

热门工具换一换