文章/答案/技术大牛

发布

社区首页 >问答首页 >booking.com的webscraping脚本不工作

问booking.com的webscraping脚本不工作
EN

Stack Overflow用户

提问于 2021-05-18 12:54:41

回答 1查看 186关注 0票数 0

我在这个页面上做了一个脚本来刮掉酒店的名称、评级和福利：链接。

这是我的剧本：

import numpy as np


import time
from random import randint
import requests
from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import re
import random

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'
links1 = []

results = requests.get(url0, headers = headers)


soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href']  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
 

root_url = 'https://www.booking.com/'
urls1 = [ '{root}{i}'.format(root=root_url, i=i) for i in links1 ]



pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)

    soup = BeautifulSoup(results.text, "html.parser")

    try :
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)

    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)

    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')



data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})


#print(data.head(20))

data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')

它起作用了，我做了一个循环来刮掉酒店的所有链接，在对所有这些酒店进行评级和补贴之后。但是我有双簧管，所以不是：links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', href=True)]

正如您在上面的脚本中所看到的那样，我将：links1 = [a['href'] for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url', href=True)]。

但是现在它已经不起作用了，我只得到了Nan，而以前，当我有双簧管的时候，我有一些和奶奶在一起，但是他们中的大多数都有我想要的额外福利和收视率。我不明白为什么。

以下是酒店链接的html：

酒店链接

下面是获得名称的html (在我获得链接之后，脚本转到这个链接)：

namehtml

下面是html，以获取与酒店相关的所有福利(比如名称，脚本转到我之前刮过的链接)：

perkshtml

这是我的结果。

输出

python

python-3.x

web-scraping

beautifulsoup

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-05-19 19:13:01

该网站上的href标记包含换行符。一个在开始的时候，也在中途。因此，当您尝试组合root_url时，您将无法获得有效的URL。

修正可以是删除所有换行符。由于href总是以/开头，所以这也可以从root_url中删除，或者您可以使用urllib.parse.urljoin()。

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
    'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
    'Referer': 'https://www.espncricinfo.com/',
    'Upgrade-Insecure-Requests': '1',
    'Connection': 'keep-alive',
    'Pragma': 'no-cache',
    'Cache-Control': 'no-cache',
}

url0 = 'https://www.booking.com/searchresults.fr.html?label=gen173nr-1DCA0oTUIMZWx5c2Vlc3VuaW9uSA1YBGhNiAEBmAENuAEXyAEM2AED6AEB-AECiAIBqAIDuAL_5ZqEBsACAdICJDcxYjgyZmI2LTFlYWQtNGZjOS04Y2U2LTkwNTQyZjI5OWY1YtgCBOACAQ;sid=303509179a2849df63e4d1e5bc1ab1e3;dest_id=-1456928;dest_type=city&'

results = requests.get(url0, headers = headers)
soup = BeautifulSoup(results.text, "html.parser")

links1 = [a['href'].replace('\n','')  for a in soup.find("div", {"class": "hotellist sr_double_search"}).find_all('a', class_ = 'js-sr-hotel-link hotel_name_link url',  href=True)]
root_url = 'https://www.booking.com'
urls1 = [f'{root_url}{i}' for i in links1]

pointforts = []
hotels = []
notes = []

for url in urls1: 
    results = requests.get(url)
    soup = BeautifulSoup(results.text, "html.parser")

    try:
        div = soup.find("div", {"class": "hp_desc_important_facilities clearfix hp_desc_important_facilities--bui"})
        pointfort = [x['data-name-en'] for x in div.select('div[class*="important_facility"]')]
        pointforts.append(pointfort)
    except:
        pointforts.append('Nan')

    try:    
        note = soup.find('div', class_ = 'bui-review-score__badge').text
        notes.append(note)
    except:
        notes.append('Nan')
    
    try:
        hotel = soup.find("h2",attrs={"id":"hp_hotel_name"}).text.strip("\n").split("\n")[1]
        hotels.append(hotel)
    except:
        hotels.append('Nan')


data = pd.DataFrame({
    'Notes' : notes,
    'Points fort' : pointforts,
    'Nom' : hotels})

#print(data.head(20))
data.to_csv('datatest.csv', sep=';', index=False, encoding = 'utf_8_sig')

这将为您提供一个输出CSV文件，启动：

Notes;Points fort;Nom
 8,3 ;['Parking (fee required)', 'Free WiFi Internet Access Included', 'Family Rooms', 'Airport Shuttle', 'Non Smoking Rooms', '24 hour Front Desk', 'Bar'];Elysées Union
 8,4 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', '24 hour Front Desk', 'Rooms/Facilities for Disabled'];Hyatt Regency Paris Etoile
 8,3 ;['Free WiFi Internet Access Included', 'Family Rooms', 'Non Smoking Rooms', 'Pets allowed', 'Restaurant', '24 hour Front Desk', 'Bar'];Pullman Paris Tour Eiffel
 8,7 ;['Free WiFi Internet Access Included', 'Non Smoking Rooms', 'Restaurant', '24 hour Front Desk', 'Rooms/Facilities for Disabled', 'Elevator', 'Bar'];citizenM Paris Gare de Lyon

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/67586627

复制

相似问题

问booking.com的webscraping脚本不工作
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问booking.com的webscraping脚本不工作EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问booking.com的webscraping脚本不工作
EN