我有一个很大的数据集,每个月记录一个人的就业状况,为期一年(>6.000.000次来自>500.000人的观察)。看起来是这样的:
ID <- c(1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3, 1:3)
Month <- c("Jan", "Jan", "Jan", "Feb", "Feb", "Feb", "Mar", "Mar", "Mar", "Apr", "Apr", "Apr",
"May", "May", "May", "Jun", "Jun", "Jun", "Jul", "Jul", "Jul", "Aug", "Aug", "Aug",
"Sep", "Sep", "Sep", "Oct", "Oct", "Oct", "Nov", "Nov", "Nov", "Dec", "Dec", "Dec")
Activity <- c("Unemployed", "Unemployed", "Other", "Employed", "Employed", "Unemployed", "Employed",
"Employed", "Employed", "Unemployed", "Unemployed", "Unemployed", "Unemployed", "Unemployed",
"Unemployed", "Employed", "Employed", "Employed", "Other", "NA", "Unemployed", "Employed", "Unemployed",
"Employed", "Inactive", "Unemployed", "Employed", "Employed", "Employed", "Unemployed",
"Other", "Unemployed", "Employed", "Unemployed", "Other", "Other")
df <- data.frame(ID, Month, Activity)
df[order(ID),]
ID Month Activity
1 1 Jan Unemployed
4 1 Feb Employed
7 1 Mar Employed
10 1 Apr Unemployed
13 1 May Unemployed
16 1 Jun Employed
19 1 Jul Other
22 1 Aug Employed
25 1 Sep Inactive
28 1 Oct Employed
31 1 Nov Other
34 1 Dec Unemployed
2 2 Jan Unemployed
5 2 Feb Employed
8 2 Mar Employed
11 2 Apr Unemployed
14 2 May Unemployed
17 2 Jun Employed
20 2 Jul NA
23 2 Aug Unemployed
26 2 Sep Unemployed
29 2 Oct Employed
32 2 Nov Unemployed
35 2 Dec Other
3 3 Jan Other
6 3 Feb Unemployed
9 3 Mar Employed
12 3 Apr Unemployed
15 3 May Unemployed
18 3 Jun Employed
21 3 Jul Unemployed
24 3 Aug Employed
27 3 Sep Employed
30 3 Oct Unemployed
33 3 Nov Employed
36 3 Dec Other我想要实现的是创建一个新的数据集,它在每个就业期()中都有一行,但要求在发生失业之前和之后,只包括就业时间。,这意味着我只想包括人们从失业到就业,再回到失业状态的就业时间,并能够计算出每个就业期间的持续时间。同样需要的是每个就业期的开始和结束月份。最后的数据文件应该如下所示:
ID Starting Ending Duration
1 1 Feb Mar 2
3 2 Feb Mar 2
5 2 Oct Oct 1
6 3 Mar Mar 1
7 3 Jun Jun 1
8 3 Aug Sep 2发布于 2021-08-18 09:51:34
只有那些在失业期间你可以使用的工作时间-
library(dplyr)
df %>%
arrange(ID) %>%
group_by(ID, grp = data.table::rleid(Activity)) %>%
summarise(Activity = first(Activity),
Starting = first(Month),
Ending = last(Month),
Duration = n()) %>%
#Keep only those 'Employed' rows that are surrounded by 'Unemployed' ones
filter(Activity == 'Employed',
lag(Activity) == 'Unemployed', lead(Activity) == 'Unemployed') %>%
ungroup %>%
select(-grp, -Activity)
# ID Starting Ending Duration
# <int> <chr> <chr> <int>
#1 1 Feb Mar 2
#2 2 Feb Mar 2
#3 2 Oct Oct 1
#4 3 Mar Mar 1
#5 3 Jun Jun 1
#6 3 Aug Sep 2https://stackoverflow.com/questions/68829871
复制相似问题