Scrapy 遇到的小坑_关于payload参数_scrapy第一次发送POST请求 - 好文

<>Scrapy框架第一发送POST请求遇到的小坑

前言：爬取一个数据开发平台悦采 <http://www.yuecai.com/purchase/?SiteID=21> 网, 爬取上面的招标_采购信息。

<>要看解决方式的直接跳到最后
<>首先思路，分析网站，找规律，我感觉这个是重要的一步。
1.先看看网页结构，找规律。

我们发现只要找到这个url然后请求，进入详细页面，就找到想要的数据了。

<>开始找规律爬取数据

我们找到了别人的api接口，里面有公司的一些相关信息，可以拼接id访问每个公司url

但是这里的url是一个post请求，而且需要Payload，我一开始没注意, 就当成普通的POST请求去使用，一直没有成功报400

这是我爬取拉钩招聘信息的时候模拟POST请求发送的数据，也成功了, 我琢磨了好久为什么，才发现POST请求需要的是Payload 有效载荷
我百度发现要把数据转换为json数据发送
import requests import json class YuecaiSpider(object): def __init__(self):
self.headers = { 'Content-Type': 'application/json', 'Host': 'iris.yuecai.com',
'Origin': 'http://www.yuecai.com', # 'Referer':
'http://www.yuecai.com/purchase/?SiteID=21', 'User-Agent': 'Mozilla/5.0
(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/65.0.3325.181 Safari/537.36', } self.data = { "page": 0, "size": 20,
"sort": None, "teseData": 2, "word": None, "zone": None, } def start_requests(
self): url = "http://iris.yuecai.com/iris/v1/purchase/search" res = requests.
post(url, headers=self.headers, data=json.dumps(self.data)) #　这里要转换为json数据 print
(res.status_code) print(res.text) if __name__ == '__main__': yuecai =
YuecaiSpider() yuecai.start_requests()
* 用requests写测试脚本的时候，一点问题没有，但是我改成scrapy后，一直没有数据。我不知道是为什么？？ import scrapy import
jsonclass YuecaiSpider(scrapy.Spider): name = 'yuecai' allowed_domains = [
'yuecai.com'] start_urls = ['http://iris.yuecai.com/iris/v1/purchase/search']
site_name= '悦采网数据平台' version = '1.0' def __init__(self): super(YuecaiSpider,
self).__init__() self.headers = { # 'Accept': 'application/json,
text/javascript, */*; q=0.01', # 'Accept-Encoding': 'gzip, deflate', #
'Accept-Language': 'zh-CN,zh;q=0.9', # 'Connection': 'keep - alive', # 'Content
- Length': '69', 'Content-Type': 'application/json', 'Host': 'iris.yuecai.com',
'Origin': 'http://www.yuecai.com', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0;
WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181
Safari/537.36', } self.data = { 'word': None, # 'NULL' 'zone': None, # 注意这边特别坑的
有的就是需要'null'这种数据 'page': '1', # 参数该用字符用字符 'size': '20', # 'sort': None,
'teseData': '2', } def start_requests(self): url =
"http://iris.yuecai.com/iris/v1/purchase/search" yield scrapy.Request( url,
method="POST", headers=self.headers, body=json.dumps(self.data), #
这边数据也是要转换成json数据 dont_filter=True, callback=self.parse, ) def parse(self,
response): print("*" * 50) print(response.status) print(response.text) print("*"
* 50)
把上面的请求的参数加了引号，需要’null’的改成None(但是有的网站POST请求参数就是’null’)
好吧看看我运行的结果把。

<>2018/11/29 更新

评论那位小哥！我在我电脑上测试没有问题，由于评论下面贴代码，格式很乱，我就贴在文章里面了，效果图如下。

测试代码如下:
# -*- coding:utf-8 -*- # @Author: wg # @Time: 2018/11/29 15:02 # @Desc: """ """
import scrapy class TestCnSpider(scrapy.Spider): name = 'test_cn'
allowed_domains= ['org.cn'] start_urls = [
'http://gs.amac.org.cn/amac-infodisc/api/pof/fund?rand=0.8266535799537897&page=0&size=20'
] """ POST
http://gs.amac.org.cn/amac-infodisc/api/pof/fund?rand=0.4768735209349304&page=0&size=20
HTTP/1.1 Host: gs.amac.org.cn Proxy-Connection: keep-alive Content-Length: 2
Accept: application/json, text/javascript, */*; q=0.01 Origin:
http://gs.amac.org.cn X-Requested-With: XMLHttpRequest User-Agent: Mozilla/5.0
(Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/70.0.3538.110 Safari/537.36 Content-Type: application/json Referer:
http://gs.amac.org.cn/amac-infodisc/res/pof/fund/index.html Accept-Encoding:
gzip, deflate Accept-Language: zh-CN,zh;q=0.9 """ headers = { "Host":
"gs.amac.org.cn", "Accept": "application/json, text/javascript, */*; q=0.01",
"Origin": "http://gs.amac.org.cn", "X-Requested-With": "XMLHttpRequest",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/70.0.3538.110 Safari/537.36", "Content-Type":
"application/json", "Referer":
"http://gs.amac.org.cn/amac-infodisc/res/pof/fund/index.html", "Accept-Language"
: "zh-CN,zh;q=0.9", } def start_requests(self): yield scrapy.Request( self.
start_urls[0], method="POST", headers=self.headers, body="{}", callback=self.
parse, dont_filter=True ) def parse(self, response): print(response.text)

热门工具换一换