50行代码爬取腾讯视频所有电影数据 - 好文

最近在学习 Scrapy 框架，为了练习，今天花了一会时间爬取了腾讯视频的电影信息。前段时间用 Java 爬取过腾讯视频，用 Jsoup
爬取，速度实在不敢恭维。最近学习 Scrapy 觉得代码简洁，爬取高效，确实是爬虫利器。

进入正题

我用的是 xpath 解析，在这里推荐一个好用的工具叫 XPath-Helper，我也是今天才发现，感觉很好用。
效果是这样的：

可以在网页上实时的看见代码效果，容易调试。
下载地址 <https://download.csdn.net/download/momentyol/10385631>

没有积分或者嫌麻烦的同学：
链接：https://pan.baidu.com/s/1iGahcAWklYG9UFCUYARjKg 密码：n7c1
插件装好之后进入页面后只需要 Ctrl+Shift+X 就可呼出工具。

开始爬取

* 其实网站的url很简单，找到全部电影页面： https://v.qq.com/x/list/movie
<https://v.qq.com/x/list/movie>
* 然后分析翻页后的url地址规律，发现每页30条数据，offset以30递增。找出总页码，然后就能分析所有页面的url地址。
https://v.qq.com/x/list/movie?offset=30
<https://v.qq.com/x/list/movie?offset=30>*（page-1）
* 然后分析页面结构，太简单了，这里就不多说了。
在这里贴出提取每个电影基本信息的 xpath 代码： video_list = response.xpath(
"//li[@class='list_item']") print(len(video_list)) for node in video_list: item
= VideoItem() name = node.xpath(".//strong[@class='figure_title']/a/text()"
).extract()[0] score_text = node.xpath(
".//div[@class='figure_score']//em/text()").extract() score = ''
.join(score_text)if len(node.xpath(".//span[@class ='figure_info']/text()")):
short_desc = node.xpath(".//span[@class ='figure_info']/text()").extract()[0]
else: short_desc = '' starts_text = node.xpath(
".//div[@class='figure_desc']/a/text()").extract() starts = ','
.join(starts_text) hot = node.xpath(".//span[@class='num']/text()").extract()[0
] play_url = node.xpath(".//a/@href").extract()[0] item['name'] = name item[
'name'] item['short_desc'] = short_desc item['score'] = score item['starts'] =
startsitem['hot'] = hot item['play_url'] = play_url yield item
然后在 pipeLine 中保存每个item，为了简单，保存在本地文件

然后就是循环翻页，保存每一页的数据：

这里的 total_page 为什么要减1 因为第一页的 offset=0

然后就能跑出数据：

大概50行代码，40秒左右跑出全部167页5010条数据。

代码放在 gayhub，想看看的自取：

gayhub链接 <https://github.com/Grois/Python-Video>

后边会继续更新，爬取爱奇艺、优酷、土豆等国内几大主流视频网站，电视剧、电影、综艺、动漫等内容。有兴趣给个star慢慢看。

热门工具换一换