首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >如何捕获requests.get()异常

如何捕获requests.get()异常
EN

Stack Overflow用户
提问于 2016-08-09 18:22:54
回答 3查看 1.8K关注 0票数 3

我正在为yellowpages.com开发一个网络刮刀,它似乎总体上运行得很好。但是,在遍历长查询的分页过程中,requests.get(url)将随机返回<Response [503]><Response [404]>。有时,我会收到更坏的例外情况,例如:

HTTPConnectionPool(host='www.yellowpages.com',port=80):最大重试超过url: /search?search_terms=florists&geo_location_terms=FL&page=22 (由NewConnectionError引起(‘:未能建立新连接: WinError 10053已建立的连接被主机中的软件中止“,)

使用time.sleep()似乎可以消除503个错误,但404和异常仍然存在问题。

我试图弄清楚如何“捕获”各种响应,这样我就可以进行更改(等待、更改代理、更改用户代理),然后再试一次和/或继续前进。伪码是这样的:

代码语言:javascript
复制
If error/exception with request.get:
    wait and/or change proxy and user agent
    retry request.get
else:
    pass

在这一点上,我甚至无法用以下方法捕捉到一个问题:

代码语言:javascript
复制
try:
    r = requests.get(url)
except requests.exceptions.RequestException as e:
    print (e)
    import sys #only added here, because it's not part of my stable code below
    sys.exit()

我从github开始的全部代码,以及下面的代码:

代码语言:javascript
复制
import requests
from bs4 import BeautifulSoup
import itertools
import csv

# Search criteria
search_terms = ["florists", "pharmacies"]
search_locations = ['CA', 'FL']

# Structure for Data
answer_list = []
csv_columns = ['Name', 'Phone Number', 'Street Address', 'City', 'State', 'Zip Code']


# Turns list of lists into csv file
def write_to_csv(csv_file, csv_columns, answer_list):
    with open(csv_file, 'w') as csvfile:
        writer = csv.writer(csvfile, lineterminator='\n')
        writer.writerow(csv_columns)
        writer.writerows(answer_list)


# Creates url from search criteria and current page
def url(search_term, location, page_number):
    template = 'http://www.yellowpages.com/search?search_terms={search_term}&geo_location_terms={location}&page={page_number}'
    return template.format(search_term=search_term, location=location, page_number=page_number)


# Finds all the contact information for a record
def find_contact_info(record):
    holder_list = []
    name = record.find(attrs={'class': 'business-name'})
    holder_list.append(name.text if name is not None else "")
    phone_number = record.find(attrs={'class': 'phones phone primary'})
    holder_list.append(phone_number.text if phone_number is not None else "")
    street_address = record.find(attrs={'class': 'street-address'})
    holder_list.append(street_address.text if street_address is not None else "")
    city = record.find(attrs={'class': 'locality'})
    holder_list.append(city.text if city is not None else "")
    state = record.find(attrs={'itemprop': 'addressRegion'})
    holder_list.append(state.text if state is not None else "")
    zip_code = record.find(attrs={'itemprop': 'postalCode'})
    holder_list.append(zip_code.text if zip_code is not None else "")
    return holder_list


# Main program
def main():
    for search_term, search_location in itertools.product(search_terms, search_locations):
        i = 0
        while True:
            i += 1
            url = url(search_term, search_location, i)
            r = requests.get(url)
            soup = BeautifulSoup(r.text, "html.parser")
            main = soup.find(attrs={'class': 'search-results organic'})
            page_nav = soup.find(attrs={'class': 'pagination'})
            records = main.find_all(attrs={'class': 'info'})
            for record in records:
                answer_list.append(find_contact_info(record))
            if not page_nav.find(attrs={'class': 'next ajax-page'}):
                csv_file = "YP_" + search_term + "_" + search_location + ".csv"
                write_to_csv(csv_file, csv_columns, answer_list)  # output data to csv file
                break

if __name__ == '__main__':
    main()

谢谢您抽出时间阅读这篇长篇文章/回复:)

EN

回答 3

Stack Overflow用户

发布于 2019-08-30 00:53:18

我一直在做类似的事情,这对我很有帮助(主要是):

代码语言:javascript
复制
# For handling the requests to the webpages
import requests
from requests_negotiate_sspi import HttpNegotiateAuth


# Test results, 1 record per URL to test
w = open(r'C:\Temp\URL_Test_Results.txt', 'w')

# For errors only
err = open(r'C:\Temp\URL_Test_Error_Log.txt', 'w')

print('Starting process')

def test_url(url):
    # Test the URL and write the results out to the log files.

    # Had to disable the warnings, by turning off the verify option, a warning is generated as the
    # website certificates are not checked, so results could be "bad". The main site throws errors
    # into the log for each test if we don't turn it off though.
    requests.packages.urllib3.disable_warnings()
    headers={'User-Agent': 'Mozilla/5.0 (X11; OpenBSD i386) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
    print('Testing ' + url)
    # Try the website link, check for errors.
    try:
        response = requests.get(url, auth=HttpNegotiateAuth(), verify=False, headers=headers, timeout=5)
    except requests.exceptions.HTTPError as e:
        print('HTTP Error')
        print(e)
        w.write('HTTP Error, check error log' + '\n')
        err.write('HTTP Error' + '\n' + url + '\n' + e + '\n' + '***********' + '\n' + '\n')
    except requests.exceptions.ConnectionError as e:
        # some external sites come through this, even though the links work through the browser
        # I suspect that there's some blocking in place to prevent scraping...
        # I could probably work around this somehow.
        print('Connection error')
        print(e)
        w.write('Connection error, check error log' + '\n')
        err.write(str('Connection Error') + '\n' + url + '\n' + str(e) + '\n' + '***********' + '\n' + '\n')
    except requests.exceptions.RequestException as e:
        # Any other error types
        print('Other error')
        print(e)
        w.write('Unknown Error' + '\n')
        err.write('Unknown Error' + '\n' + url + '\n' + e + '\n' + '***********' + '\n' + '\n')
    else:
        # Note that a 404 is still 'successful' as we got a valid response back, so it comes through here
        # not one of the exceptions above.
        response = requests.get(url, auth=HttpNegotiateAuth(), verify=False)
        print(response.status_code)
        w.write(str(response.status_code) + '\n')
        print('Success! Response code:', response.status_code)
    print('========================')

test_url('https://stackoverflow.com/')

我目前在某些站点超时时仍然有一些问题,您可以按照我的尝试在这里解决:2个有效的URL,requests.get()在1上失败,而另一个失败。为什么?

票数 2
EN

Stack Overflow用户

发布于 2016-08-09 18:27:19

像这样的东西怎么样

代码语言:javascript
复制
try:
    req = ..
    if req.status_code == 503:
        pass
    elif ..:
        pass
    else:
        do something when request succeeds
except ConnectionError:
    pass
票数 0
EN

Stack Overflow用户

发布于 2019-03-29 07:00:52

你可以试试这个

try: #do something except requests.exceptions.ConnectionError as exception: #handle the newConnectionError exception except Exception as exception: #handle any exception

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/38857883

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档