我有一个数据集合和一个变量,其中包含其中一些数据的索引。对数据应用过滤操作,以消除数据的子集。我希望移动索引,以便它们引用更新后的数据集合(消除对已删除实例的索引)。
我在下面的函数中使用了实现。我还发布了我用来验证它是否工作的代码。有没有一种通过核心库进行索引调整的快速方法,或者更好的方法?
import random
def align_index(wanted_idx, mask):
"""
Function to align a set of indexes to a collection after deletions,
indicated with a mask
Arguments:
wanted_idx: List of desired integer indexes prior to deletion
mask: Binary mask, where 1's indicate elements that survive deletion
Returns:
List of integer indexes to (surviving) desired elements, post-deletion
"""
# rebuild indexes: remove dangling
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
# mark deleted
not_mask = [int(not m) for m in mask]
# cumsum deleted regions
realigned_idx = [k-sum(not_mask[:k+1]) for k in new_idx]
return realigned_idx
# data
data = [random.randint(0,500) for _ in range(1000)]
rng = list(range(len(data)))
for _ in range(1000):
# random data deletion / request
wanted_idx = random.sample(rng, random.randint(5,100))
del_index = random.sample(rng, random.randint(5, 100))
# apply deletion
mask = [int(i not in del_index) for i in range(len(data))]
filtered_data = [data[i] for (i, m) in enumerate(mask) if m]
realigned_index = align_index(wanted_idx, mask)
# verify
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
l1 = [data[k] for k in new_idx]
l2 = [filtered_data[k] for k in realigned_index]
assert l1 == l2发布于 2020-11-09 06:07:42
如果你使用numpy,它是非常简单的:
import numpy as np
mask = np.array(mask, dtype=np.bool)
new_idx = np.cumsum(mask, dtype=np.int64)
new_idx[mask] = -1除非删除更多的元素,否则不需要重新计算new_idx。
然后,只需查看new_idx[i],就可以获得旧索引i的重新映射索引。或一次一个完整的数组:
wanted_idx = np.array(wanted_idx, dtype=np.int64)
remapped_idx = new_idx[wanted_idx]请注意,删除的索引将被赋值为-1。如果需要,您可以过滤掉以下内容:
remapped_idx = remapped_idx[remapped_idx >= 0]https://stackoverflow.com/questions/64582337
复制相似问题