我想根据df2中的时间间隔选择df1中的数据:
df1:
ID Start End col1 col2
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a
...df2:
ID Timestap col3 col4
23468 2011-01-03 01:01:03 3 aa
23468 2011-01-03 01:02:00 4 bb
23468 2011-01-03 12:01:03 7 aa
33525 2011-02-03 02:31:03 10 aa
33525 2011-02-04 12:01:03 20 aa
33525 2011-02-05 14:00:01 30 aa
...如果df2中的ID与df1中的Start和End匹配,则需要过滤df1,然后计算组的col3平均值,并在df1中创建一个名为Average的新列,期望输出:
ID Start End col1 col2 Average
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 3.5
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b nan
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a nan
...
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 20
...我尝试过使用for-循环,但由于数据文件太大,合并两个数据流将花费更长的时间,所以需要很长时间,我想知道apply()与lambda表达式是否能解决这个问题?如何引用来自另一个df的时间间隔?
更新:如果我想根据df2中的时间间隔过滤df1,那么就找出第一个col3和最后一个col3之间的区别(即。6-3 =3,除以第一个和最后一个Timestamp(即.2011-01-03 01:03:03 :03-01:03= 120秒)所以期望值是3/120=0.025。
df1:
ID Start End col1 col2
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a (*)
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a
...df2:
ID Timestap col3 col4
23468 2011-01-03 01:01:03 3 aa first row in time interval (*)
23468 2011-01-03 01:02:00 4 bb
23468 2011-01-03 01:03:03 6 aa last row in time interval (*)
23468 2011-01-03 12:01:03 7 aa
33525 2011-02-03 02:31:03 10 aa
33525 2011-02-04 12:01:03 20 aa
33525 2011-02-05 14:00:01 30 aa因此,预期产出:
ID Start End col1 col2 Average
23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 0.025 (3/120=0.025)
23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b nan
23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a nan
...
33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 0.000067032 (20/298364 =0.000067032)
...发布于 2022-08-22 14:36:18
您可以使用列表生成器,它比循环快几倍。在第2行,如果我没有弄错的话,应该是20。左边的df1.iloc是行索引,右边是列号。在df2中,使用显式索引:左边是作为索引的布尔掩码(真或假),右边是列的名称。
import pandas as pd
df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])
aaa = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
& (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].mean() for i in range(len(df1))]
df1['Average'] = aaa
print(df1)输出
ID Start End col1 col2 Average
0 23468 2011-01-03 01:01:03 2011-01-03 01:04:05 10 a 3.5
1 23468 2011-01-15 08:20:00 2011-01-18 01:01:01 50 b NaN
2 23468 2011-02-03 01:07:20 2011-02-08 12:00:03 150 a 20.0
3 33525 2011-02-03 01:07:19 2011-02-06 12:00:03 10 a 20.0更新26.08.2022.我们将匹配行的索引从df2写入df1 1‘123’列。
在“ind”中,复制每行中大于或等于两个索引的索引。
“‘qqq”获取每一行的最后一个元素和第一个元素之间的差异。
差几秒钟。
“‘tq”我们得到了所需的内容,并将其存储在df1'test’列中。
import pandas as pd
import numpy as np
df1[['Start', 'End']] = df1[['Start', 'End']].apply(pd.to_datetime)
df2['Timestap'] = pd.to_datetime(df2['Timestap'])
df1['123'] = [df2.loc[(df2['Timestap'] >= df1.iloc[i, 1])
& (df2['Timestap'] <= df1.iloc[i, 2]), 'col3'].values for i in range(len(df1))]
ind = df1[df1['123'].str.len().ge(2)].index
qqq = df1.loc[ind, '123'].str[-1] - df1.loc[ind, '123'].str[0]
ttt = (df1.loc[ind, 'End'] - df1.loc[ind, 'Start']) / np.timedelta64(1, 's')
tq = qqq / ttt
df1['test'] = tq
print(df1)输出
ID Start End ... col2 123 test
0 23468 2011-01-03 01:01:03 2011-01-03 01:04:05 ... a [3, 4, 6] 0.016484
1 23468 2011-01-15 08:20:00 2011-01-18 01:01:01 ... b [] NaN
2 23468 2011-02-03 01:07:20 2011-02-08 12:00:03 ... a [10, 20, 30] 0.000042
3 33525 2011-02-03 01:07:19 2011-02-06 12:00:03 ... a [10, 20, 30] 0.000067https://stackoverflow.com/questions/73440765
复制相似问题