我正在尝试从下面的网页中获取google colab上的表格:https://247sports.com/college/penn-state/Sport/Football/AllTimeRecruits/
下面是我正在尝试使用的python脚本...
Team = 'penn-state'
url = "https://247sports.com/college/" + str(Team) + "/Sport/Football/AllTimeRecruits/"
# Add the `user-agent` otherwise we will get blocked when sending the request
headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"}
response = requests.get(url, headers = headers).content
soup = BeautifulSoup(response, "html.parser")
data = []
for tag in soup.find_all("li", class_="ri-page__list-item"): # `[1:]` Since the first result is a table header
rank = tag.find_next("span", class_="all-time-rank").text
school = tag.find_next("span", class_="meta").text
year = tag.find_next("span", class_="meta").text
name = tag.find_next("a", class_="ri-page__name-link").text
position = tag.find_next("div", class_="position").text
height_weight = tag.find_next("div", class_="metrics").text
rating = tag.find_next("span", class_="score").text
nat_rank = tag.find_next("a", class_="natrank").text
state_rank = tag.find_next("a", class_="sttrank").text
pos_rank = tag.find_next("a", class_="posrank").text
# status = tag.find_next("p", class_="commit-date withDate").text
data.append(
{
"Rank": rank,
"Name": name,
"School": school,
"Class of": year,
"Position": position,
"Height & Weight": height_weight,
"Rating": rating,
"National Rank": nat_rank,
"State Rank": state_rank,
"Position Rank": pos_rank,
# "Date": status,
}
)
df = pd.DataFrame(data)
df我想要一个专栏,上面写着那个球员是在哪届招兵班的。例如,如果一个球员来自"class of 2005",我希望"2005“作为"year”列的列值。
Rank Name School Class of Position Height & Weight Rating National Rank State Rank Position Rank
0 1 Derrick Williams Eleanor Roosevelt (Greenbelt, MD) Eleanor Roosevelt (Greenbelt, MD) WR 6-0 / 190 0.9986 4 1 2
1 2 Micah Parsons Harrisburg (Harrisburg, PA) Harrisburg (Harrisburg, PA) WDE 6-3 / 235 0.9982 5 1 2
2 3 Justin Shorter South Brunswick (Monmouth Junction, NJ) ... South Brunswick (Monmouth Junction, NJ) ... WR 6-4 / 213 0.9962 8 1 1
3 4 Dan Connor Strath Haven (Wallingford, PA) Strath Haven (Wallingford, PA) ILB 6-3 / 215 0.9944 13 1 2
4 5 Justin King Gateway (Monroeville, PA) Gateway (Monroeville, PA) CB 6-0 / 185 0.9942 15 1 2
... ... ... ... ... ... ... ... ... ... ...
242 243 Will Levis Xavier (Middletown, CT) Xavier (Middletown, CT) PRO 6-4 / 222 0.8689 652 2 28
243 244 Troy Reeder Salesianum (Wilmington, DE) Salesianum (Wilmington, DE) ILB 6-2 / 230 0.8687 500 2 22
244 245 Jake Cooper Archbishop Wood (Warminster, PA) Archbishop Wood (Warminster, PA) ILB 6-1 / 220 0.8686 520 11 17
245 246 Jon Ditto Gateway (Monroeville, PA) Gateway (Monroeville, PA) WR 6-3 / 221 0.8684 417 16 52
246 247 Shareef Miller George Washington (Philadelphia, PA) George Washington (Philadelphia, PA) SDE 6-5 / 230 0.8681 525 12 27
247 rows × 10 columns然而,我在学校得到的却是复制品。这是因为在html中,在观察html代码时,高中和年份都在"span“下找到。这就是说,有没有一种方法可以根据html的设置来筛选高中和年份呢?
任何关于如何使这项工作的援助将是真正的感谢。
发布于 2021-05-29 00:13:08
您有两个包含meta类的spans --第一个用于学校,第二个用于年份(始终按此顺序),因此可以使用find_all查找这两个类,然后从第一个类中提取school,从第二个类中提取year:
for tag in soup.find_all("li", class_="ri-page__list-item"):
meta = tag.find_all("span", class_="meta")
school = meta[0].text
year = meta[1].text.replace('Class of ', '')
# extract other fields...
# data.append(...)https://stackoverflow.com/questions/67741919
复制相似问题