首页
学习
活动
专区
圈层
工具
发布
社区首页 >问答首页 >美丽汤bs.find_all('a')没有在网页上工作

美丽汤bs.find_all('a')没有在网页上工作
EN

Stack Overflow用户
提问于 2022-03-31 17:56:20
回答 2查看 69关注 0票数 0

有谁能确切地解释一下,是否有一种方法可以使用https://hackmd.io/@nearly-learning/near-201从这个网页BeautifulSoup中抓取链接,还是只能使用Selenium?

代码语言:javascript
复制
url = 'https://hackmd.io/@nearly-learning/near-201' 
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'lxml') # also tried all other parcers 
links = bs.find_all('a') # only obtains 23 links, when there are actually loads more.
for link in links: 
    if 'href' in link.attrs:
        print(link.attrs['href'])

只有几个链接和非在实际正文的文章。

不过,我可以用Selenium做这件事:

代码语言:javascript
复制
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

但是如果可能的话,我想使用BeautifulSoup!谁知道是不是呢?

EN

回答 2

Stack Overflow用户

发布于 2022-03-31 18:28:36

如果不想使用selenium,可以使用Markdown包将标记文本呈现给HTML,并使用BeautifulSoup解析它:

代码语言:javascript
复制
import markdown  # pip install Markdown
import requests
from bs4 import BeautifulSoup

# 1. get raw markdown text
url = "https://hackmd.io/@nearly-learning/near-201"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
md_txt = soup.select_one("#doc").text

# 2. render the markdown to HTML
html = markdown.markdown(md_txt)

# 3. parse it again and find all <a> links
soup = BeautifulSoup(html, "html.parser")

for a in soup.select("a[href]"):
    print(a["href"])

指纹:

代码语言:javascript
复制
https://cdixon.org/2018/02/18/why-decentralization-matters
https://docs.near.org/docs/concepts/gas#ballpark-comparisons-to-ethereum
https://docs.near.org/docs/roles/integrator/exchange-integration#blocks-and-finality
https://docs.near.org/docs/concepts/architecture/papers
https://explorer.near.org/nodes/validators
https://explorer.near.org/stats
https://docs.near.org/docs/develop/contracts/rust/intro
https://docs.near.org/docs/develop/contracts/as/intro
https://docs.near.org/docs/api/rpc

...and so on.
票数 3
EN

Stack Overflow用户

发布于 2022-03-31 18:19:26

正如前面提到的,它需要selenium或类似的东西来呈现所有的内容,并且如果您喜欢以这种方式选择元素,则可以在da mix中使用seleniumBeautifulSoup

只需将driver.page_source推送到BeautifulSoup()

代码语言:javascript
复制
bs = BeautifulSoup(driver.page_source)

示例

代码语言:javascript
复制
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")

bs = BeautifulSoup(driver.page_source)

for link in bs.select('a[href]'):
    print(link['href'])
票数 1
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/71697148

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档