文章/答案/技术大牛

发布

社区首页 >问答首页 >BeautifulSoup4解析html

问BeautifulSoup4解析html
EN

Stack Overflow用户

提问于 2014-08-08 05:11:10

回答 1查看 474关注 0票数 0

我需要抓取所有的高中名称连同他们的城市从这个网站。使用BeautifulSoup4。我在下面添加了none工作代码。非常感谢。

http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas

import urllib2
bs4 import BeautifulSoup

opener = urllib2.build_opener()
opener.addheaders = [('User-again','Mozilla/5.0' ) ]

url = ("http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas")

ourUrl = opener.open(url).read()

soup = BeautifulSoup(ourUrl)

print get_text(soup.find_all('il'))

！html

python

html

parsing

beautifulsoup

回答 1

Stack Overflow用户

发布于 2014-08-08 06:00:30

你的程序中有许多错误。下面是一个可以作为额外优化基础的工作实例。

import requests # much better than using urllib2
from bs4 import BeautifulSoup # you forgot the `from`

url = "http://en.wikipedia.org/wiki/List_of_high_schools_in_Texas" 
# you don't need () around it
r = requests.get(url) 
# does everything all at once, no need to call `opener` and `read()`
contents = r.text # get the HTML contents of the page

soup = BeautifulSoup(contents)
for item in soup.find_all('li'): # 'li' and 'il' are different things...
    print item.get_text()        # you need to iterate over all the elements
                                 # found by `find_all()`

就是这样。这将使您获得页面上每个<li>...</li>项的文本。当您运行该程序时，您将看到有许多不相关的结果，例如目录、左侧的菜单项、页脚等。我将留给您自己去弄清楚如何只获取学校的名称，并区分出县名称和其他繁琐的内容。

作为参考，请仔细阅读BS docs。他们会回答你的很多问题。

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/25192278

复制

相似问题

问BeautifulSoup4解析html
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BeautifulSoup4解析htmlEN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问BeautifulSoup4解析html
EN