我正在解析从一个网站的链接,然后试图解析这些链接的iframe src。
我正在运行CentOS6.5Python2.7.5
刮花蜘蛛new.py -o videos.csv
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/@href').extract():
from scrapy.http.request import Request
yield Request('http://www.pdga.com'+link, callback=self.parse_page, meta={'link':link})
def parse_page(self, response):
for frame in response.xpath("//player").extract():
yield {
'link': response.urljoin(frame)
}调试结果
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-gbo-final-round-front-9-sexton-mcbeth-mccray-newhouse> (referer: http://www.pdga.com/videos/)
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-1-pierce-fajkus-leatherman-c-allen-sexton-leatherman> (referer: http://www.pdga.com/videos/)
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-gbo-final-round-back-9-sexton-mcbeth-mccray-newhouse> (referer: http://www.pdga.com/videos/)预期结果
发布于 2017-07-21 11:12:38
刮刮不刮iFrames的内容,但你可以得到它们。首先获取iframe url,然后调用它的解析。
urls = response.css('iframe::attr(src)').extract()
for url in urls :
yield scrapy.Request(url....)https://stackoverflow.com/questions/43819255
复制相似问题