下面是我的数据框的样子:

这是它的dput结构。
structure(list(tier_1 = c("Organic Search", "Organic Search",
"Organic Search", "Organic Search", "Organic Search", "Organic Search",
"Organic Search", "Organic Search", "Organic Search", "Organic Search",
"Organic Social", "Organic Social", "Organic Social", "Organic Social",
"Organic Social", "Organic Social", "Organic Social", "Paid Search",
"Paid Search", "Paid Search", "Paid Search", "Paid Search", "Paid Search",
"Paid Search", "Paid Search", "Paid Search", "Paid Social", "Paid Social",
"Paid Social", "Paid Social", "Paid Social", "Paid Social", "Paid Social",
"Paid Social", "Paid Social"), sequence_number = c(1L, 2L, 3L,
4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), count_of_sequence_numbers = c(1176L, 460L, 119L, 41L, 21L,
5L, 8L, 6L, 2L, 1L, 133L, 52L, 11L, 2L, 2L, 1L, 1L, 7516L, 1090L,
284L, 90L, 36L, 21L, 12L, 6L, 2L, 1979L, 674L, 99L, 30L, 11L,
2L, 3L, 2L, 1L), percent = c(0.637744034707158, 0.249457700650759,
0.0645336225596529, 0.022234273318872, 0.0113882863340564, 0.0027114967462039,
0.00433839479392625, 0.00325379609544469, 0.00108459869848156,
0.000542299349240781, 0.655172413793103, 0.25615763546798, 0.0541871921182266,
0.00985221674876847, 0.00985221674876847, 0.00492610837438424,
0.00492610837438424, 0.827662151745402, 0.120030833608633, 0.0312740887567449,
0.00991080277502478, 0.00396432111000991, 0.00231252064750578,
0.0013214403700033, 0.000660720185001652, 0.000220240061667217,
0.704019921736037, 0.23977232301672, 0.0352187833511206, 0.0106723585912487,
0.00391319815012451, 0.000711490572749911, 0.00106723585912487,
0.000711490572749911, 0.000355745286374956)), row.names = c(NA,
-35L), groups = structure(list(tier_1 = c("Organic Search", "Organic Social",
"Paid Search", "Paid Social"), .rows = structure(list(1:10, 11:17,
18:26, 27:35), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, -4L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
df <- df %>%
group_by(tier_1, sequence_number) %>%
summarize(count_of_sequence_numbers = length(sequence_number)) %>%
mutate(percent = count_of_sequence_numbers / sum(count_of_sequence_numbers)) %>%
filter(sequence_number <= 10)通过使用上面的代码,我能够得到百分比列,特别是关于count / sum(count)的部分。
然而,我确实有一个问题,那就是百分比是不正确的。当引用sequence_number =2时,count_of_sequence_numbers中的值应该从count_of_sequence_numbers中的值中减去sequence_number =1的值(在同一类别中)。当引用sequence_number =3时,count_of_sequence_numbers中的所有内容都应该从sequence_number =2和sequence_number = 3时的count_of_sequence_numbers中减去。
我的意思是,我真的需要一个序列号的计数,对于sequence_number = 1,不包括2-10,当它为2时,不包括3-10,依此类推。1176的值实际上应该是1176 - 460 - 119 - 41 - 21 -5 -8 -6- 2 -1。460值应为460 - 119 - 41 - 21 -5 -8 -6 -2 -1。然后,百分比应该从那里计算出来。
我尝试了一个领先函数,但我不认为这是一个有效的方法。:/那个-1175数字特别让我紧张。
df <- df %>%
group_by(tier_1) %>%
arrange(tier_1, sequence_number) %>%
mutate(diff = count_of_sequence_numbers - lead(count_of_sequence_numbers, default = first(count_of_sequence_numbers)))

如果我更改为lead(count_of_sequence_numbers,默认值=0),我会得到更好的行为,但这仍然不是我想要做的事情,即用同一组中序列号更大的所有其他值的总和减去这个值。

发布于 2021-06-13 01:18:32
这是您要查找的输出吗?
df %>%
arrange(tier_1, -sequence_number) %>%
group_by(tier_1) %>% # already grouped this way, only including for clarity
mutate(cuml = cumsum(lag(count_of_sequence_numbers, default = 0)),
diff = count_of_sequence_numbers - cuml) %>%
ungroup()
## A tibble: 35 x 6
# tier_1 sequence_number count_of_sequence_numbers percent cuml diff
# <chr> <int> <int> <dbl> <dbl> <dbl>
# 1 Organic Search 10 1 0.000542 0 1
# 2 Organic Search 9 2 0.00108 1 1
# 3 Organic Search 8 6 0.00325 3 3
# 4 Organic Search 7 8 0.00434 9 -1
# 5 Organic Search 6 5 0.00271 17 -12
# 6 Organic Search 5 21 0.0114 22 -1
# 7 Organic Search 4 41 0.0222 43 -2
# 8 Organic Search 3 119 0.0645 84 35
# 9 Organic Search 2 460 0.249 203 257
#10 Organic Search 1 1176 0.638 663 513
## … with 25 more rowshttps://stackoverflow.com/questions/67951014
复制相似问题