文章/答案/技术大牛

发布

社区首页 >问答首页 >scrapy-splash脚本找不到CSS选择器

问scrapy-splash脚本找不到CSS选择器
EN

Stack Overflow用户

提问于 2017-07-05 08:05:51

回答 1查看 606关注 0票数 0

我正在尝试创建一个scrapy-splash脚本，以便从以下位置获取食品链接：

https://www.realcanadiansuperstore.ca/Food/Meat-%26-Seafood/c/RCSS001004000000

当您第一次访问它时，它会让您选择一个地域。我想我已经通过在下面的代码中设置cookies dict正确地处理了这个问题。我正在尝试获取旋转木马中所有食物的链接。我之所以使用splash，是因为carousel是由javascript创建的，而常规的请求和解析不会在html中显示它。我的问题是，我没有得到任何数据到我的‘项目’字典。

import scrapy
from scrapy_splash import SplashRequest

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ["https://www.realcanadiansuperstore.ca/Food/Meat-%26-
    Seafood/c/RCSS001004000000"]


    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, cookies={'currentRegion' :'CA-BC'}, 
            callback = self.parse, endpoint = 'render.html', args = {'wait':0.5},
                            )

def parse(self, response):

    item = {}
    item['urls'] = []

    itemList = response.css('div.product-name-wrapper > a > ::attr(href)').extract()

    for links in itemList:
        item['urls'].append(links)

    yield item

我认为我的cookie设置不正确，所以它会将我带到需要选择区域的页面。

顺便说一句，我也在docker控制台上运行了splash。如果我在浏览器中转到我的本地主机，它会显示启动页面。

下面是我从爬虫中得到的输出：

<GET https://www.realcanadiansuperstore.ca/Food/Meat-%26-
Seafood/c/RCSS001004000000 via http://localhost:8050/render.html> 
(referer: None)
2017-07-04 16:44:05 [scrapy.core.scraper] DEBUG: Scraped from <200 
https://www.realcanadiansuperstore.ca/Food/Meat-%26-
Seafood/c/RCSS001004000000>
{'urls': []}

这里会出什么问题呢？我已经按照下面的描述填写了设置文件：https://github.com/scrapy-plugins/scrapy-splash

好了，通过如下设置cookie，我已经能够让Splash的本地主机浏览器实例呈现我需要的HTML：

function main(splash)
    splash:add_cookie{"sessionid", "237465ghgfsd", "/", 
    domain="http://example.com"}
    splash:go("http://example.com/")
    return splash:html()
end

但这是在浏览器中作为您可以输入的脚本。如何将其应用于我的python脚本？在Python中有没有不同的方式来添加cookie？

scrapy

splash-screen

scrapy-splash

回答 1

Stack Overflow用户

发布于 2017-07-21 05:01:14

如果有适合您的脚本，您可以使用/execute端点来执行此脚本：

yield SplashRequest(url, endpoint='execute', args={'lua_source': my_script})

scrapy splash还允许设置透明的cookie处理，这样cookie就可以像常规scrapy.Requests一样在SplashRequests中持续存在：

script = """
function main(splash)
  splash:init_cookies(splash.args.cookies)
  assert(splash:go{
    splash.args.url,
    headers=splash.args.headers,
    http_method=splash.args.http_method,
    body=splash.args.body,
    })
  assert(splash:wait(0.5))

  local entries = splash:history()
  local last_response = entries[#entries].response
  return {
    url = splash:url(),
    headers = last_response.headers,
    http_status = last_response.status,
    cookies = splash:get_cookies(),
    html = splash:html(),
  }
end
"""

class MySpider(scrapy.Spider):

    # def my_parse...
    #   ...
        yield SplashRequest(url, self.parse_result,
            endpoint='execute',
            cache_args=['lua_source'],
            args={'lua_source': script},
        )

    def parse_result(self, response):
        # here response.body contains result HTML;
        # response.headers are filled with headers from last
        # web page loaded to Splash;
        # cookies from all responses and from JavaScript are collected
        # and put into Set-Cookie response header, so that Scrapy
        # can remember them.

请参阅scrapy splash自述文件中的examples。

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44915382

复制

相似问题

问scrapy-splash脚本找不到CSS选择器
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问scrapy-splash脚本找不到CSS选择器EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问scrapy-splash脚本找不到CSS选择器
EN