假设我有一个3列的dataframe,dt, unit, sold。我想知道的是,如何创建一个新的专栏,名为say,prior_3_avg,顾名思义,unit在过去的三天里的平均sold和dt一样。例如,在2020年5月5日的"1“单元中,它在4月28日、21日和14日的平均销售量是多少,最后三个星期四是哪一个?
玩具样本数据:
df = pd.DataFrame({'dt':['2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28','2020-5-1','2020-5-2','2020-5-3','2020-5-4','2020-5-5','2020-5-6','2020-5-7','2020-5-8','2020-5-9','2020-5-10','2020-5-11','2020-5-12','2020-5-13','2020-5-14','2020-5-15','2020-5-16','2020-5-17','2020-5-18','2020-5-19','2020-5-20','2020-5-21','2020-5-22','2020-5-23','2020-5-24','2020-5-25','2020-5-26','2020-5-27','2020-5-28',],'unit':[1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2],'sold':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28]})
df['dt'] = pd.to_datetime(df['dt'])
dt unit sold
0 2020-05-01 1 1
1 2020-05-02 1 2
2 2020-05-03 1 3
3 2020-05-04 1 4
4 2020-05-05 1 5
5 2020-05-06 1 6
...我该怎么做?我见过:Pandas new column from groupby averages
这解释了如何只在列上进行分组。我想我可以做一个“一周中的一天”专栏,但是我仍然有同样的问题,那就是要限制在过去的3天匹配一周的值,而不仅仅是所有的结果。
这可能与此有关,但这看起来更适合一次性分析,而不是创建一个新的专栏:limit amount of rows as result of groupby Pandas。
发布于 2022-05-05 19:40:35
这应该是可行的:
df['dayofweek'] = df['dt'].dt.dayofweek
df['output'] = df.apply(lambda x: df['sold'][(df.index < x.name) & (df.dayofweek == x.dayofweek)].tail(3).sum(), axis = 1)发布于 2022-05-05 19:14:57
首先,用一天创建一个新列
import pandas as pd
date = pd.date_range('2018-12-30', '2019-01-07',
freq='D').to_series()
date.dt.dayofweek这将给出当天和之后的数字,只需使用月份进行筛选,并对值进行排序。
发布于 2022-05-05 20:10:51
这里有一个想法:首先由unit分组,然后按工作日将每个unit分组,然后得到n周的滚动平均值( closed='left',不包括当前的n周)用于计算,这似乎是您想要的).
n = 3
result = (df.groupby('unit')
.apply(lambda f: (f['sold']
.groupby(f.dt.dt.day_name())
.rolling(n, closed='left')
.mean()
)
)
)本系列中的...which结果如下:
unit dt
1 Friday 0 NaN
7 NaN
14 NaN
21 8.0
Monday 3 NaN
10 NaN
17 NaN
24 11.0
...
2 Friday 28 NaN
35 NaN
42 NaN
49 8.0
Monday 31 NaN
38 NaN
45 NaN
52 11.0
...
Name: sold, dtype: float64接下来,去掉unit和time索引级别,我们不需要它们。另外,为了更容易地使用join,请重命名该系列。
result = result.reset_index(level=[0, 1], drop=True)
result = result.rename('prior_3_avg')回到母舰..。
df2 = df.join(result)df2最终结果的一部分
time unit sold prior_3_avg
... # first 21 are NaN
21 2020-05-22 1 22 8.0
22 2020-05-23 1 23 9.0
23 2020-05-24 1 24 10.0
24 2020-05-25 1 25 11.0
25 2020-05-26 1 26 12.0
26 2020-05-27 1 27 13.0
27 2020-05-28 1 28 14.0https://stackoverflow.com/questions/72132377
复制相似问题