文章/答案/技术大牛

发布

社区首页 >问答首页 >通过用户项交互的数量过滤Pandas数据

问通过用户项交互的数量过滤Pandas数据
EN

Stack Overflow用户

提问于 2022-01-28 12:57:24

回答 2查看 224关注 0票数 1

给我一个带有列user_id, item_id, timestamp的Pandas。假设只有一个user_id - item_id交互，即只有一个特定交互的时间戳：

     user_id  item_id   timestamp
  0      1       2         123
  1      1       3         145
  2      4       6         123
  3      5       7         198

给定一个参数threshold，我过滤掉那些出现<= threshold次数的user_ids：

data = data.groupby("user_id").filter(lambda x: len(x) > threshold)

和get (对于threshold = 1)：

     user_id  item_id   timestamp
  0      1       2         123
  1      1       3         145

因为只有用户"1“比threshold项交互更多。

现在，假设可能存在多个user_id - item_id交互，也就是说，特定交互可能有几个时间戳：

     user_id  item_id   timestamp
  0      1       2         123
  1      1       2         145
  2      4       6         123
  3      4       7         198

怎样才能最优雅(最快)地过滤掉那些具有<= threshold数量的独特交互的用户？然后，期望的输出将是：

     user_id  item_id   timestamp
  0      4       6         123
  1      4       7         198

(因为用户"1“只与1项交互，而用户"4”仍然存在，因为他与2项交互)。

我想到的一种方式(没那么优雅，是吧？)：

data_cold = data.groupby('user_id').agg({'item_id':lambda x: x.nunique()})
data_cold = data_cold.reset_index()
data_cold = data_cold[data_cold.item_id > threshold]
data = data[data['user_id'].isin(data_cold.user_id)]

python

python-3.x

pandas

numpy

回答 2

Stack Overflow用户

回答已采纳

发布于 2022-01-28 16:42:13

我还会考虑在user_id和item_id列中使用groupby，但聚合方式不同：

import pandas as pd

#sample data
from io import StringIO
data1 = """
  user_id  item_id   timestamp
  0      1       2         123
  1      1       2         145
  2      4       6         123
  3      4       7         198
  4      4       7         172
  5      1       1         163
  6      1       1         172
  7      2       4         871
  8      4       3         883   """
df = pd.read_csv(StringIO(data1), sep = "\s{2,}", engine="python")


threshold = 2

#strategy: count number of user_id-item_id pairs
#remove rows with at least n non-NaN values, where n is defined by your threshold
filtered_ID = df.groupby(["user_id", "item_id"]).size().unstack().dropna(thresh=threshold+1).index

#then filter the original dataframe for the retrieved user_id's
print(df[df["user_id"].isin(filtered_ID)])

样本输出：

   user_id  item_id  timestamp
2        4        6        123
3        4        7        198
4        4        7        172
8        4        3        883

票数 1

Stack Overflow用户

发布于 2022-01-28 13:34:29

如果我正确理解，您可以按两列分组，然后用size()计数组。在此之后，唯一的交互将有1的计数，如果需要，可以进行过滤：

data = data.groupby(["user_id", "item_id"]).size().reset_index(name = 'counts')
uniques = data[data['counts'] == 1]

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/70894132

复制

相似问题

问通过用户项交互的数量过滤Pandas数据
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过用户项交互的数量过滤Pandas数据EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问通过用户项交互的数量过滤Pandas数据
EN