文章/答案/技术大牛

发布

社区首页 >问答首页 >我正在尝试合并两个数组。我正在设法安排它(ip，端口)。我怎样才能按我想要的方式安排呢？

问我正在尝试合并两个数组。我正在设法安排它(ip，端口)。我怎样才能按我想要的方式安排呢？
EN

Stack Overflow用户

提问于 2022-01-27 02:08:50

回答 1查看 39关注 0票数 0

标题。我要注意的是，该项目从一个免费的代理网站上解析IP、端口及其类型(https与否)，然后在linux上进行测试，以确定它们是否工作。它保存元组中的元组并将它们写入csv。

import requests
import lxml
from bs4 import BeautifulSoup
import csv

names = []

url = 'https://free-proxy-list.net/'
page = requests.get(url)
soup = BeautifulSoup(page.content, features='lxml')
headers = soup.find_all('th')
headers_refined = []
headers_refined.append(headers[0])
headers_refined.append(headers[1])
headers_refined.append(headers[6])
ips = soup.find_all('td')


ips = ips[::8]
ports = soup.find_all('td')
ports = ports[1::8]

element_index = 0
for i in ips:
    ips[element_index] = str(ips[element_index])
    element_index += 1
    
element_index = 0
for i in headers_refined:
    headers_refined[element_index] = str(headers_refined[element_index])
    element_index += 1
    
element_index = 0
for i in ports:
    ports[element_index] = str(ports[element_index])
    element_index += 1
    
ips = ' '.join(ips).replace('<td>', '').split()
ips = ' '.join(ips).replace('</td>', '').split()
ips = ips[:-43:]
headers_refined = ' '.join(headers_refined).replace('<th>', '').split()
headers_refined = ' '.join(headers_refined).replace('</th>', '').split()
headers_refined = ' '.join(headers_refined).replace('<th class="hx">', '').split()
ports = ' '.join(ports).replace('<td>', '').split()
ports = ' '.join(ports).replace('</td>', '').split()
while len(ports)>len(ips):
    ports=ports[:-1:]
prev_len_ips=len(ips)
index=0
for i in range(prev_len_ips):
    ips.insert(i+1,ports[i])


# print(headers_refined)
# print(ips)
# print(ports)
print(prev_len_ips)
print(len(ports))



print(ips)
ips = [*zip(ips[::2])]
with open('ips.csv', '+w', newline='') as csv_file:
    writer = csv.writer(csv_file)
    writer.writerows(ips)

上面的代码以如下顺序打印出列表：

['IP','port','port','port','port',...]

直到它耗尽所有可用的端口。在此之后，它将打印列表中剩下的in。

我很乐意接受关于改进和优化我的代码以使之看起来更好的任何其他建议。提前谢谢你！

python

csv

beautifulsoup

python-requests

lxml

回答 1

Stack Overflow用户

回答已采纳

发布于 2022-01-27 07:25:39

要从那一页中得到你想要的东西，有更简单的方法可以实现。由于您已经将lxml用作解析器，因此这完全可以满足您的需要：

from urllib.request import urlopen, Request
from lxml import etree

# free-proxy-list.net doesn't like Python announcing itself, use at your own risk
req = Request(
    'https://free-proxy-list.net/',
    headers={
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
    }
)

# reading the contents of the page, getting the part you need
with urlopen(req) as f:
    root = etree.parse(f, parser=etree.HTMLParser())
    # get the proxies from the only textarea on the page, skip the description and timestamp
    proxies = root.xpath('*//textarea/text()')[0].split('\n')[3:]

# the format you want
proxies = [tuple(proxy.split(':')) for proxy in proxies]
print(proxies)

在lxml之外没有外部依赖项(没有bs4或requests)，只有几行代码。

结果：

[('64.17.30.238', '63141'), ('62.33.210.34', '58918'), ... ]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70872503

复制

相似问题

问我正在尝试合并两个数组。我正在设法安排它(ip，端口)。我怎样才能按我想要的方式安排呢？
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我正在尝试合并两个数组。我正在设法安排它(ip，端口)。我怎样才能按我想要的方式安排呢？EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问我正在尝试合并两个数组。我正在设法安排它(ip，端口)。我怎样才能按我想要的方式安排呢？
EN