文章/答案/技术大牛

发布

社区首页 >问答首页 >美丽汤bs.find_all('a')没有在网页上工作

问美丽汤bs.find_all('a')没有在网页上工作
EN

Stack Overflow用户

提问于 2022-03-31 17:56:20

回答 2查看 69关注 0票数 0

有谁能确切地解释一下，是否有一种方法可以使用https://hackmd.io/@nearly-learning/near-201从这个网页BeautifulSoup中抓取链接，还是只能使用Selenium？

url = 'https://hackmd.io/@nearly-learning/near-201' 
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'lxml') # also tried all other parcers 
links = bs.find_all('a') # only obtains 23 links, when there are actually loads more.
for link in links: 
    if 'href' in link.attrs:
        print(link.attrs['href'])

只有几个链接和非在实际正文的文章。

不过，我可以用Selenium做这件事：

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

但是如果可能的话，我想使用BeautifulSoup！谁知道是不是呢？

beautifulsoup

findall

python

web-scraping

回答 2

Stack Overflow用户

发布于 2022-03-31 18:28:36

如果不想使用selenium，可以使用Markdown包将标记文本呈现给HTML，并使用BeautifulSoup解析它：

import markdown  # pip install Markdown
import requests
from bs4 import BeautifulSoup

# 1. get raw markdown text
url = "https://hackmd.io/@nearly-learning/near-201"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
md_txt = soup.select_one("#doc").text

# 2. render the markdown to HTML
html = markdown.markdown(md_txt)

# 3. parse it again and find all <a> links
soup = BeautifulSoup(html, "html.parser")

for a in soup.select("a[href]"):
    print(a["href"])

指纹：

https://cdixon.org/2018/02/18/why-decentralization-matters
https://docs.near.org/docs/concepts/gas#ballpark-comparisons-to-ethereum
https://docs.near.org/docs/roles/integrator/exchange-integration#blocks-and-finality
https://docs.near.org/docs/concepts/architecture/papers
https://explorer.near.org/nodes/validators
https://explorer.near.org/stats
https://docs.near.org/docs/develop/contracts/rust/intro
https://docs.near.org/docs/develop/contracts/as/intro
https://docs.near.org/docs/api/rpc

...and so on.

票数 3

Stack Overflow用户

发布于 2022-03-31 18:19:26

正如前面提到的，它需要selenium或类似的东西来呈现所有的内容，并且如果您喜欢以这种方式选择元素，则可以在da mix中使用selenium和BeautifulSoup。

只需将driver.page_source推送到BeautifulSoup()

bs = BeautifulSoup(driver.page_source)

示例

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")

bs = BeautifulSoup(driver.page_source)

for link in bs.select('a[href]'):
    print(link['href'])

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/71697148

复制

相似问题

问美丽汤bs.find_all('a')没有在网页上工作
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问美丽汤bs.find_all('a')没有在网页上工作EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问美丽汤bs.find_all('a')没有在网页上工作
EN