下面是一个小数据集:
df = pd.DataFrame({'KEY' : [100, 100, 100, 100, 100, 100, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 101, 102, 102],
'RPTDATE' : ['2002-10-23', '2002-10-23', '2002-10-30', '2002-10-30', '2002-11-6', '2002-11-6', \
'2005-12-7', '2005-12-7','2005-12-7','2005-12-21','2005-12-21','2005-12-21','2005-12-21','2005-12-21', \
'2005-12-21','2006-1-4','2006-1-4','2006-1-4', '2015-7-8', '2015-7-8'],
'ENDDEPTH' : [3248, 3248, 10998, 10998, 12545, 12545, 12836, 12836, 12836, 18744, 18744, 18744, \
18744, 18744, 18744, 20255, 20255, 20255, 6250, 6250],
'CODE' : ['GR', 'RES', 'GR', 'RES', 'GR', 'RES', 'GR', 'RES', 'PWD', 'GR', 'RES', 'PWD', 'DEN', 'NEUT', \
'SON', 'GR', 'RES', 'PWD', 'APWD', 'ARC']})

我最初需要创建一个名为STDEPTH的新列,并在键相同时使用上一行的ENDDEPTH填充它。如果密钥不相同,则STARTDATE将为0。在SO上的某个人的帮助下,我创建了这个脚本:
df['STDEPTH'] = df.groupby(['KEY','CODE'])['ENDDEPTH'].shift(fill_value=0)这将生成上面的日期框,这几乎就是我所需要的;但是,第12-14行中突出显示的3个0需要与第9-11行相同。如果我从脚本中删除'CODE‘,我会得到这样的结果:

如果我将“CODE”更改为“RPTDATE”,则每当日期更改时,STDEPTH都会填充0。
我需要做什么才能使STDEPTH始终遵循前一行的ENDDEPTH,无论代码是什么?
谢谢!
发布于 2021-02-13 22:47:51
从提供的原始df开始,执行相同的第一步。然后创建唯一的"KEY,RPTDATE,STDEPTH,ENDDEPTH“值的数据帧,其中STDEPTH不是0,并使用它填充STDEPTH为0的值:
import numpy as np # needed later
df.insert(2, 'STDEPTH', 0) # place the new column in a useful place
# the first step as you originally did:
df['STDEPTH'] = df.groupby(['KEY','CODE'])['ENDDEPTH'].shift(fill_value=0)
# create the dataframe which has the needed UNIQUE values
match_start = df[df['STDEPTH'] != 0][['KEY', 'RPTDATE', 'STDEPTH', 'ENDDEPTH']].drop_duplicates()match_start是:
KEY RPTDATE STDEPTH ENDDEPTH
2 100 2002-10-30 3248 10998
4 100 2002-11-6 10998 12545
9 101 2005-12-21 12836 18744
15 101 2006-1-4 18744 20255具有先前唯一非缺失值的Join:
df = df.join(match_start.set_index(['KEY', 'RPTDATE', 'ENDDEPTH']),
on=['KEY', 'RPTDATE', 'ENDDEPTH'],
rsuffix='_match')df[7:16]看起来像这样:
KEY RPTDATE STDEPTH ENDDEPTH CODE STDEPTH_match
7 101 2005-12-7 0 12836 RES NaN
8 101 2005-12-7 0 12836 PWD NaN
9 101 2005-12-21 12836 18744 GR 12836.0
10 101 2005-12-21 12836 18744 RES 12836.0
11 101 2005-12-21 12836 18744 PWD 12836.0
12 101 2005-12-21 0 18744 DEN 12836.0
13 101 2005-12-21 0 18744 NEUT 12836.0
14 101 2005-12-21 0 18744 SON 12836.0
15 101 2006-1-4 18744 20255 GR 18744.0使用np.where()将"right“值赋给STDEPTH
df['STDEPTH'] = np.where((df['STDEPTH'] == 0) & (df['STDEPTH_match'].notna()),
df['STDEPTH_match'], # value if True
df['STDEPTH'] # value if False
).astype(int) # convert to int's
# remove the extra column
df.drop('STDEPTH_match', axis=1, inplace=True)结果df
KEY RPTDATE STDEPTH ENDDEPTH CODE
0 100 2002-10-23 0 3248 GR
1 100 2002-10-23 0 3248 RES
2 100 2002-10-30 3248 10998 GR
3 100 2002-10-30 3248 10998 RES
4 100 2002-11-6 10998 12545 GR
5 100 2002-11-6 10998 12545 RES
6 101 2005-12-7 0 12836 GR
7 101 2005-12-7 0 12836 RES
8 101 2005-12-7 0 12836 PWD
9 101 2005-12-21 12836 18744 GR
10 101 2005-12-21 12836 18744 RES
11 101 2005-12-21 12836 18744 PWD
12 101 2005-12-21 12836 18744 DEN
13 101 2005-12-21 12836 18744 NEUT
14 101 2005-12-21 12836 18744 SON
15 101 2006-1-4 18744 20255 GR
16 101 2006-1-4 18744 20255 RES
17 101 2006-1-4 18744 20255 PWD
18 102 2015-7-8 0 6250 APWD
19 102 2015-7-8 0 6250 ARChttps://stackoverflow.com/questions/66163453
复制相似问题