python爬虫爬取淘宝失败原因分析 - 好文

正则表达式data = re.findall(‘g_page_config = (.*?)g_srp_loadCss’, html,re.S)[0]
报错out of range
去掉[0]后输出，只输出了一个空列表，发现其实并没有抓取到网页信息，空列表里取首元素就出现了out of range的错误。
输出html后发现代码和网页源代码不相同，没有应有的商品信息。

思考得出大概有两种可能

*
淘宝页面异步加载，必须鼠标滑动到这一点才能加载信息，所以得不到商品的信息。

*
爬取到的html要求登陆，很可能是淘宝的反爬虫机制

所用代码（python3）
import urllib.request import re import json #要爬取的网页 url =
'https://s.taobao.com/search?q=python' #伪装成浏览器 headers = ('user-agent',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/68.0.3440.106 Safari/537.36') opener=urllib.request.build_opener()
opener.addheaders=[headers] urllib.request.install_opener(opener) #爬出网页源文件 file
= urllib.request.urlopen(url) file.encoding = 'utf-8' html = str(file.read())
#查看爬到的html #print(html) data = re.compile('g_page_config = (.*?)g_srp_loadCss',
re.S).findall(html) print(data)
输出爬到的网页信息
>>>import requests >>>r=requests.get("https://s.taobao.com/search?q=python") >>
>r.text[:10000] #部分输出 <i class="iconfont"></i> "登录页面"改进建议\r\n </a>\r\n</
div>\r\n\t\t\t\r\n\t\t</div>\r\n\t\t<div id="content">\r\n\t\t
\r\n\r\n\r\n\r\n\r\n\r\n<div class="login-newbg" style="background-image:
url(https://gtms01.alicdn.com/tps/i1/TB1GTCYLXXXXXcHXpXXcoeQ2VXX-2500-600.jpg);"
>\r\n\t<input type="hidden" id="J_adUrl" name="adUrl" value="">\r\n\t<input type
="hidden" id="J_adImage" name="adImage" value="">\r\n\t<input type="hidden" id=
"J_adText" name="adText" value="">\r\n\t<input type="hidden" id="J_viewFd4PC" </
i>扫码登录更安全\r\n </div>\r\n </div>\r\n </div>\r\n\t\r\n\t</div>\r\n <div class="bd"
>\r\n\t\t\r\n\t\t\r\n\t\t <div id="J_QuickLogin" class=
"ww-login hidden">\r\n\t\t\t<form action="" class="ww-form">\r\n\t\t\t\t<div
class="login-title">\r\n\t\t\t\t\t选择其中一个已登录的账户\r\n\t\t\t\t</div>\r\n\r\n\t\t\t\t
<div class="ww-userlist">\r\n\r\n\t\t\t\t</div>\r\n\t\t\t\t <div class="trigger"
>\r\n\r\n\t\t\t\t</div>\r\n\t\t\t\t <div class="submit">\r\n\t\t\t\t\t<button
type="submit" class="J_Submit" id="J_SubmitQuick">登录</button>\r\n\t\t\t\t</div>
\r\n\t\t\t\t<div class="other-login">\r\n\t\t\t\t\t<a href="" class="light-link"
id="J_Sso2Static">使用其他账户登录</a>\r\n\t\t\t\t</div>\r\n\t\t\t</form>\r\n </div>\r\n
<div id="J_MiserLogin" class="ww-login hidden">\r\n <form action="" class=
"ww-form">\r\n <input type="hidden" id="x_token" value="">\r\n <div class=
"login-title">\r\n 选择其中一个已登录的账户\r\n </div>\r\n\r\n
查看了淘宝的robots.txt文件 https://www.taobao.com/robots.txt
<https://www.taobao.com/robots.txt>
（关于robots.txt文件的具体事项可以百度）

其内容如下：
User-agent: Baiduspider
Allow: /article
Allow: /oshtml
Allow: /ershou
Disallow: /product/
Disallow: /

User-Agent: Googlebot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Disallow: /

User-agent: Bingbot
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Disallow: /

User-Agent: 360Spider
Allow: /article
Allow: /oshtml
Allow: /ershou
Disallow: /

User-Agent: Yisouspider
Allow: /article
Allow: /oshtml
Allow: /ershou
Disallow: /

User-Agent: Sogouspider
Allow: /article
Allow: /oshtml
Allow: /product
Allow: /ershou
Disallow: /

User-Agent: Yahoo! Slurp
Allow: /product
Allow: /spu
Allow: /dianpu
Allow: /oversea
Allow: /list
Allow: /ershou
Disallow: /

User-Agent: *
Disallow: /

通过最后一行我们可以看出它禁止了其他所有爬虫的访问，over

<>那么应该如何避免爬取到禁止爬取的网页？

这时候就要用到python自带的robotparser模块
import urllib.robotparser rp = urllib.robotparser.RobotFileParser() rp.set_url(
'https://s.taobao.com/robots.txt') rp.read() url = 'https://s.taobao.com'
user_agent= 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0' rp.can_fetch(
user_agent,url)
这里输出的结果为False，表示不能爬取。

热门工具换一换