我的数据集包含了2009年至2019年的银行信息,但一些银行在此过程中被合并/收购或关闭,因此我想删除2009年至2019年期间不存在的任何银行。例如,ID 32和56在2019年不存在,因此应该删除它们。以下是我的数据的样子:
ID Assets Year
32 10 2009
45 5 2009
56 24 2009
78 9 2009
32 11 2010
45 6 2010
56 31 2010
78 14 2010
... ... ...
32 11 2018
45 13 2018
78 14 2018
45 13 2019
78 3 2019从2009年到2019年,只有ID 45和78存在,所以其他的都应该删除。下面是它应该是什么样子:
ID Assets Year
45 5 2009
78 9 2009
45 6 2010
78 14 2010
... ... ...
45 13 2018
78 14 2018
45 13 2019
78 3 2019发布于 2020-11-15 03:26:21
假设您有一个已关闭的银行ids列表:
closed = [56, 32]
df[~df['ID'].isin(closed)]相反,假设您有一个现有银行的列表:
opened = [45, 78]
df[df['ID'].isin(opened)]基于问题澄清的编辑
数据:
df = pd.DataFrame({'Assets': {0: 10, 1: 5, 2: 24, 3: 9, 4: 11, 5: 6, 6: 31, 7: 14, 8: 11, 9: 13, 10: 14, 11: 13, 12: 3},
'ID': {0: 32, 1: 45, 2: 56, 3: 78, 4: 32, 5: 45, 6: 56, 7: 78, 8: 32, 9: 45, 10: 78, 11: 45, 12: 78},
'Year': {0: 2009, 1: 2009, 2: 2009, 3: 2009, 4: 2010, 5: 2010, 6: 2010, 7: 2010, 8: 2018, 9: 2018, 10: 2018, 11: 2019, 12: 2019}})如果您需要获得每年可用的in列表:
## number of unique years each id needs to have:
year_count = len(df['Year'].unique())
## get number of unique years that each id has:
id_year_count = df[['Year','ID']].groupby(['ID', 'Year']).count().reset_index().groupby('ID').count().reset_index()
## filter to get the list of ids that match the condition:
opened_every_year = id_year_count['ID'][id_year_count['Year']==year_count].tolist()
## pass the list to the df to filter as described before
df = df[df['ID'].isin(opened_every_year)]检查:
df.sort_values(['ID', 'Year'])
产出:
Assets ID Year
1 5 45 2009
5 6 45 2010
9 13 45 2018
11 13 45 2019
3 9 78 2009
7 14 78 2010
10 14 78 2018
12 3 78 2019(请注意,数据样本只有4年的数据,因此它选择了所有4年的it )
https://stackoverflow.com/questions/64840519
复制相似问题