文章/答案/技术大牛

发布

社区首页 >问答首页 >熊猫数据:基于列和时间范围的副本

问熊猫数据:基于列和时间范围的副本
EN

Stack Overflow用户

提问于 2017-06-27 09:45:00

回答 2查看 2.3K关注 0票数 6

我有一只(非常简单的)熊猫数据，它看起来像这样：

df

    datetime             user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
2  2012-11-21 17:00:08   u3     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

我现在要做的是获取所有在3秒内具有时间戳的复制消息，。预期的产出将是：

   datetime              user   type   msg
0  2012-11-11 15:41:08   u1     txt    hello world
1  2012-11-11 15:41:11   u2     txt    hello world
3  2012-11-22 18:08:35   u4     txt      hello you
4  2012-11-22 18:08:37   u5     txt      hello you

没有第三行，因为其文本与第一行和第二行相同，但时间戳不在3秒范围内。

我尝试将列datetime和msg定义为duplicate()方法的参数，但是它返回一个空的dataframe，因为时间戳是不一样的：

mask = df.duplicated(subset=['datetime', 'msg'], keep=False)

print(df[mask])
Empty DataFrame
Columns: [datetime, user, type, msg, MD5]
Index: []

有什么方法可以为我的“日期时间”参数定义一个范围吗？为了举例说明，类似于：

mask = df.duplicated(subset=['datetime_between_3_seconds', 'msg'], keep=False)

这里的任何帮助都将是非常感谢的。

duplicates

conditional-statements

python

pandas

datetime

回答 2

Stack Overflow用户

发布于 2017-06-27 11:28:30

这段代码提供了预期的输出。

df[(df.groupby(["msg"], as_index=False)["datetime"].diff().fillna(0).dt.seconds <= 3).reset_index(drop=True)]

我对dataframe的"msg“列进行了分组，然后选择了该dataframe的"datetime”列，并使用了内置函数比较。Diff函数查找该列值之间的差异。使用零填充NaT值，并仅选择值小于3秒的索引。

在使用上述代码之前，请确保您的数据在日期时间上按升序排序。

票数 6

Stack Overflow用户

发布于 2017-06-27 11:01:26

这段代码适用于您的示例数据，尽管您可能需要处理任何极端的情况。

从您的问题中，我假设您希望过滤掉第一次出现在df中的消息。如果有要保留字符串的实例(如果字符串在另一个阈值之后再次出现)，则它将无法工作。

简而言之，我编写了一个函数，它将接受您的dataframe和要筛选的'msg‘。它使用消息第一次出现的时间戳，并将其与其他所有出现的时间进行比较。

然后，它只选择在第一次出现后3秒内出现的实例。

    import numpy as np
    import pandas as pd
    #function which will return dataframe containing messages within three seconds of the first message
    def get_info_within_3seconds(df, msg):
        df_of_msg = df[df['msg']==msg].sort_values(by = 'datetime')
        t1 = df_of_msg['datetime'].reset_index(drop = True)[0]
        datetime_deltas = [(i -t1).total_seconds() for i in df_of_msg['datetime']]
        filter_list = [i <= 3.0 for i in datetime_deltas]
        return df_of_msg[filter_list]

    msgs = df['msg'].unique()
    #apply function to each unique message and then create a new df 
    new_df = pd.concat([get_info_within_3seconds(df, i) for i in msgs])

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/44777114

复制

相似问题

问熊猫数据:基于列和时间范围的副本
EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫数据:基于列和时间范围的副本EN

回答 2

Stack Overflow用户

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问熊猫数据:基于列和时间范围的副本
EN