我的数据集是基于疾病计数的。许多变量都是分类的,如WeekSeries、MonthSeries和YearSeries。这些标签是指疾病计数在我的时间序列数据中属于哪个星期、月份和年份。
我面临的问题是构建另一个数据表,该数据表将基于WeekSeries、MonthSeries和YearSeries对计数进行求和。我需要我的方法来决定WeekSeries 1将被编码为TS1 =1还是TS2=1。例如,在原始数据中,您可以看到第三个观察结果不是在TS1中,而是在TS2中,因为它在TS2中,所以也有HolidaysPerSeason=10。
我希望用这种方法来确定,如果WeekSeries 1中的大多数观测属于TS1=1和HolidaysPerSeason =11,那么这将是WeekSeries=1的最终类别。
原始数据
WeekSeries Counts TS1 TS2 TS3 TS4 TS5 TS6 HolidaysPerSeason
1 0 1 0 0 0 0 0 11
1 1 1 0 0 0 0 0 11
1 1 0 1 0 0 0 0 10理想格式
WeekSeries Counts TS1 TS2 TS3 TS4 TS5 TS6 HolidaysPerSeason
1 2 1 0 0 0 0 0 11这种格式对于建立回归模型和其他分析是必要的。
这是与我真实数据相似的假数据:
# a couple of the variables within my data
JulianDate<-c(10985, 10986,10987)
DateRcd<-c(NA,NA,"2000-01-31")
Counts<-c(0,1,1)
Day<-c("Sat","Sun","Mon")
Weekend<-c(1,1,0)
Season<-c(1,1,2)
HolidaysPerSeason<-c(11,11,10)
TS1<-c(1,1,0)
TS2<-c(0,0,1)
TS3<-c(0,0,0)
TS4<-c(0,0,0)
TS5<-c(0,0,0)
TS6<-c(0,0,0)
WeekSeries<-c(1,1,1)
YearSeries<-c(1,1,1)
MonthSeries<-c(1,1,1)
mydata<-data.table(JulianDate,DateRcd,Counts,Day,Weekend,Season,HolidaysPerSeason, TS1,TS2,TS3,TS4,TS5,TS6,YearSeries,MonthSeries,WeekSeries) #data simulation我尝试使用data.table()函数基于WeekSeries进行聚合,然后将其与原始数据合并,以构建理想的分析格式。
我最接近成功的尝试
install.packages("data.table")
library(data.table)
DT <- data.table(mydata)
mydata1<-DT[, by = list(WeekSeries)] #doesn't work
mydata2<-DT[,sum(CountsofCholera), by=WeekSeries] #loses all the other variables
idealdata<-merge(mydata2,mydata,by.x=mydata2$WeekSeries) #attempts to regain the lost variable, this doesn't work because the datasets are not the same length我能做些什么来恢复其他的分类变量呢?
发布于 2016-03-17 16:14:56
这可以在几个点上进行优化,但是应该给您一个基本的想法:
# sum up counts and count number of rows with identical values for the last several columns
DT[, .(Count = sum(Counts), .N), by = c(tail(names(DT), -4))][
# assign same count number = total count to each row within same WeekSeries
, Count := sum(Count), by = WeekSeries][
# extract most frequent row (i.e. one with largest N, computed in line 1)
, .SD[which.max(N)], by = WeekSeries]
# WeekSeries Weekend Season HolidaysPerSeason TS1 TS2 TS3 TS4 TS5 TS6 YearSeries MonthSeries Count N
#1: 1 1 1 11 1 0 0 0 0 0 1 1 2 2发布于 2016-03-17 16:09:44
group_by是你要找的吗?比如,像这样的东西?您应该安装dplyr和data.table。
mydata_new <- mydata %>% group_by(WeekSeries, TS1, HolidaysPerSeason) %>% summarise(count = n())
https://stackoverflow.com/questions/36065343
复制相似问题