有谁能确切地解释一下,是否有一种方法可以使用https://hackmd.io/@nearly-learning/near-201从这个网页BeautifulSoup中抓取链接,还是只能使用Selenium?
url = 'https://hackmd.io/@nearly-learning/near-201'
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'lxml') # also tried all other parcers
links = bs.find_all('a') # only obtains 23 links, when there are actually loads more.
for link in links:
if 'href' in link.attrs:
print(link.attrs['href'])只有几个链接和非在实际正文的文章。
不过,我可以用Selenium做这件事:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
print(elem.get_attribute("href"))但是如果可能的话,我想使用BeautifulSoup!谁知道是不是呢?
发布于 2022-03-31 18:28:36
如果不想使用selenium,可以使用Markdown包将标记文本呈现给HTML,并使用BeautifulSoup解析它:
import markdown # pip install Markdown
import requests
from bs4 import BeautifulSoup
# 1. get raw markdown text
url = "https://hackmd.io/@nearly-learning/near-201"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
md_txt = soup.select_one("#doc").text
# 2. render the markdown to HTML
html = markdown.markdown(md_txt)
# 3. parse it again and find all <a> links
soup = BeautifulSoup(html, "html.parser")
for a in soup.select("a[href]"):
print(a["href"])指纹:
https://cdixon.org/2018/02/18/why-decentralization-matters
https://docs.near.org/docs/concepts/gas#ballpark-comparisons-to-ethereum
https://docs.near.org/docs/roles/integrator/exchange-integration#blocks-and-finality
https://docs.near.org/docs/concepts/architecture/papers
https://explorer.near.org/nodes/validators
https://explorer.near.org/stats
https://docs.near.org/docs/develop/contracts/rust/intro
https://docs.near.org/docs/develop/contracts/as/intro
https://docs.near.org/docs/api/rpc
...and so on.发布于 2022-03-31 18:19:26
正如前面提到的,它需要selenium或类似的东西来呈现所有的内容,并且如果您喜欢以这种方式选择元素,则可以在da mix中使用selenium和BeautifulSoup。
只需将driver.page_source推送到BeautifulSoup()
bs = BeautifulSoup(driver.page_source)示例
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://hackmd.io/@nearly-learning/near-201")
bs = BeautifulSoup(driver.page_source)
for link in bs.select('a[href]'):
print(link['href'])https://stackoverflow.com/questions/71697148
复制相似问题