我有一个名为df的文件,它具有以下值:
Date Available Used Total Free
06072019 5 19 24 5
06202019 14 10 24 6
07072019 6 16 24 6
07202019 20 4 24 20
08072019 23 1 24 23我错过了日期08202019,并希望用我拥有的现有数据的平均值来归因于缺失的值。
这就是我目前正在做的事情:
import numpy as np
import datetime as dt
df.groupby([df.index.date]).transform(lambda x: x.fill(x.mean()))然而,我知道这里的一些语法是不正确的,我想要一些建议。
发布于 2020-10-01 08:07:25
使用列表理解来获得除第一个日期列之外的所有列的平均值,并创建一个转置的数据框架,该数据框架实际上是一行。然后将这个新的单行数据帧与主数据帧连接起来。
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
s.columns = df.columns[1:]
s['Date'] = '08-20-2019'
df_new = pd.concat([df, s])
df_new
Out[1]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0关于你的评论,你可以创建一个missing_dates列表,其他的一切都将通过循环自动完成:
df = pd.DataFrame({'Date': {0: '06-07-2019',
1: '06-20-2019',
2: '07-07-2019',
3: '07-20-2019',
4: '08-07-2019'},
'Available': {0: 5, 1: 14, 2: 6, 3: 20, 4: 23},
'Used': {0: 19, 1: 10, 2: 16, 3: 4, 4: 1},
'Total': {0: 24, 1: 24, 2: 24, 3: 24, 4: 24},
'Free': {0: 5, 1: 6, 2: 6, 3: 20, 4: 23}})
s = pd.DataFrame([df[col].mean() for col in df.columns[1:]]).T
t = s
missing_dates = ['08-20-2019' , '08-30-2019']
for i in range(len(missing_dates) - 1):
t = t.append(s)
t.columns = df.columns[1:]
t['Date'] = missing_dates
df_new = pd.concat([df, t])
df_new
Out[2]:
Date Available Used Total Free
0 06-07-2019 5.0 19.0 24.0 5.0
1 06-20-2019 14.0 10.0 24.0 6.0
2 07-07-2019 6.0 16.0 24.0 6.0
3 07-20-2019 20.0 4.0 24.0 20.0
4 08-07-2019 23.0 1.0 24.0 23.0
0 08-20-2019 13.6 10.0 24.0 12.0
0 08-30-2019 13.6 10.0 24.0 12.0发布于 2020-10-01 16:21:55
考虑一下您的数据帧:
import pandas as pd
import numpy as np
df = pd.read_clipboard()
df["Date"] = pd.to_datetime(df["Date"], format="%m%d%Y")
df = df.set_index("Date")
print(df)
Available Used Total Free
Date
2019-06-07 5 19 24 5
2019-06-20 14 10 24 6
2019-07-07 6 16 24 6
2019-07-20 20 4 24 20
2019-08-07 23 1 24 23您可以创建一个缺少日期的列表,将其转换为pandas datetime数组,然后创建一个新的索引,然后使用列平均值填充na。
missing_dates = ["2019-08-20", "2019-09-07", "2019-09-20"]
missing_dates = pd.to_datetime(missing_dates)
new_index = df.index.union(missing_dates)
df = df.reindex(new_index).fillna(df.mean(numeric_only=True))
print(df)
Available Used Total Free
2019-06-07 5.0 19.0 24.0 5.0
2019-06-20 14.0 10.0 24.0 6.0
2019-07-07 6.0 16.0 24.0 6.0
2019-07-20 20.0 4.0 24.0 20.0
2019-08-07 23.0 1.0 24.0 23.0
2019-08-20 13.6 10.0 24.0 12.0
2019-09-07 13.6 10.0 24.0 12.0
2019-09-20 13.6 10.0 24.0 12.0发布于 2020-10-01 08:24:06
如果您想要检查数据中不存在的日期,可以尝试执行以下操作:
# suppose data series goes from '2020-09-30' to '2020-10-04' and data on some dates may be missing.
df = pd.DataFrame.from_dict(
dict(datetime=['09302020','10012020','10042020'],
val1 = [1,3,5],
val2 = [6,10,12]))
df.datetime = pd.to_datetime(df.datetime, format='%m%d%Y')
print(df)
dates_missing = pd.date_range(start = '2020-09-30', end = '2020-10-04' ).difference(df.datetime)
val_means = {col: df[col].mean() for col in list(df.columns) if col != 'datetime'}
df = df.append(pd.DataFrame.from_dict(
dict(datetime=dates_missing, **val_means)))
df = df.sort_values(by=['datetime'])
dfhttps://stackoverflow.com/questions/64147304
复制相似问题