我有一个URL:
http://somewhere.com/relatedqueries?limit=2&query=seedterm
其中,修改输入、限制和查询将生成所需的数据。Limit是可能的最大术语数,query是种子术语。
URL提供的文本结果格式如下:
查询reqId:‘1303596067112929220’,状态:‘ok’,sig:'1303596067112929220',表:{cols:{id:‘oo.visualization.Query.setResponse({version:'0.5',’,label:'Score',类型:‘number’,pattern:'#,##0.###'},{id:'query',label:'Query',类型:‘string’,pattern:''},行:[{c:{v:0.9894380670262618,f:'0.99'},{v:‘新术语1’}},{c:{v:0.9894380670262618,f:'0.99'},{v:‘新术语2’}}],p:{‘totalResultsCount’:‘7727’});
我想写一个python脚本,它有两个参数(limit number和查询种子),在线获取数据,解析结果,并返回一个列表,在本例中包含新术语'newterm1','newterm2‘。
我想要一些帮助,特别是在URL抓取方面,因为我以前从来没有这样做过。
发布于 2010-10-30 07:51:48
听起来你可以把这个问题分成几个子问题。
子问题
在编写完成的脚本之前,有几个问题需要解决:
形成请求URL
这只是简单的字符串格式化。
url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=2, seedterm='seedterm')Python 2注释
这里需要使用字符串格式化运算符(%)。
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s‘url = url_template % dict(limit=2,seedterm='seedterm')
检索数据
为此,您可以使用内置的urllib.request模块。
import urllib.request
data = urllib.request.urlopen(url) # url from previous section这将返回一个名为data的类似文件的对象。您还可以在此处使用with-语句:
with urllib.request.urlopen(url) as data:
# do processing herePython 2注释
导入urllib2而不是urllib.request。
解开JSONP
您粘贴的结果看起来像JSONP。假设调用(oo.visualization.Query.setResponse)的包装函数没有变化,我们可以简单地剥离这个方法调用。
result = data.read()
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]解析JSON
生成的result字符串就是JSON数据。使用内置的json模块对其进行解析。
import json
result_object = json.loads(result)遍历对象图
现在,您有了一个表示JSON响应的result_object。对象本身是一个带有version、reqId等键的dict。根据你的问题,这里是你创建列表需要做的事情。
# Get the rows in the table, then get the second column's value for
# each row
terms = [row['c'][2]['v'] for row in result_object['table']['rows']]把所有这些放在一起
#!/usr/bin/env python3
"""A script for retrieving and parsing results from requests to
somewhere.com.
This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python3 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""
import urllib.request
import json
import sys
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
# Strip JSONP function wrapper
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
# Deserialize JSON to Python objects
result_object = json.loads(result)
# Get the rows in the table, then get the second column's value
# for each row
return [row['c'][2]['v'] for row in result_object['table']['rows']]
def retrieve_terms(limit, seedterm):
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://somewhere.com/relatedqueries?limit={limit}&query={seedterm}'
url = url_template.format(limit=limit, seedterm=seedterm)
try:
with urllib.request.urlopen(url) as data:
data = perform_request(limit, seedterm)
result = data.read()
except:
print('Could not request data from server', file=sys.stderr)
exit(E_OPERATION_ERROR)
terms = parse_result(result)
print(terms)
def main(limit, seedterm):
"""Retrieves and parses data and prints each term to standard output"""
terms = retrieve_terms(limit, seedterm)
for term in terms:
print(term)
if __name__ == '__main__'
try:
limit = int(sys.argv[1])
seedterm = sys.argv[2]
except:
error_message = '''{} limit seedterm
limit must be an integer'''.format(sys.argv[0])
print(error_message, file=sys.stderr)
exit(2)
exit(main(limit, seedterm))Python 2.7版本
#!/usr/bin/env python2.7
"""A script for retrieving and parsing results from requests to
somewhere.com.
This script works as either a standalone script or as a library. To use
it as a standalone script, run it as `python2.7 scriptname.py`. To use it
as a library, use the `retrieve_terms` function."""
import urllib2
import json
import sys
E_OPERATION_ERROR = 1
E_INVALID_PARAMS = 2
def parse_result(result):
"""Parse a JSONP result string and return a list of terms"""
prefix = 'oo.visualization.Query.setResponse('
suffix = ');'
# Strip JSONP function wrapper
if result.startswith(prefix) and result.endswith(suffix):
result = result[len(prefix):-len(suffix)]
# Deserialize JSON to Python objects
result_object = json.loads(result)
# Get the rows in the table, then get the second column's value
# for each row
return [row['c'][2]['v'] for row in result_object['table']['rows']]
def retrieve_terms(limit, seedterm):
"""Retrieves and parses data and returns a list of terms"""
url_template = 'http://somewhere.com/relatedqueries?limit=%(limit)d&query=%(seedterm)s'
url = url_template % dict(limit=2, seedterm='seedterm')
try:
with urllib2.urlopen(url) as data:
data = perform_request(limit, seedterm)
result = data.read()
except:
sys.stderr.write('%s\n' % 'Could not request data from server')
exit(E_OPERATION_ERROR)
terms = parse_result(result)
print terms
def main(limit, seedterm):
"""Retrieves and parses data and prints each term to standard output"""
terms = retrieve_terms(limit, seedterm)
for term in terms:
print term
if __name__ == '__main__'
try:
limit = int(sys.argv[1])
seedterm = sys.argv[2]
except:
error_message = '''{} limit seedterm
limit must be an integer'''.format(sys.argv[0])
sys.stderr.write('%s\n' % error_message)
exit(2)
exit(main(limit, seedterm))发布于 2010-10-30 07:27:35
我没有很好地理解您的问题,因为从您的代码看,我似乎使用了Visualization API (顺便说一下,这是我第一次听说它)。
但是,如果您只是在搜索一种从web页面获取数据的方法,那么您可以使用urllib2这只是为了获取数据,而如果您想解析检索到的数据,则必须使用更合适的库,如BeautifulSoop
如果您正在处理另一个web服务(RSS、Atom、RPC),而不是web页面,那么您可以找到一堆可以使用的python库,它们可以完美地处理每个服务。
import urllib2
from BeautifulSoup import BeautifulSoup
result = urllib2.urlopen('http://somewhere.com/relatedqueries?limit=%s&query=%s' % (2, 'seedterm'))
htmletxt = resul.read()
result.close()
soup = BeautifulSoup(htmltext, convertEntities="html" )
# you can parse your data now check BeautifulSoup API.https://stackoverflow.com/questions/4056375
复制相似问题