我有这样的数据输入:
df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product","yahoo product, amazon","yahoo product stock", "google stock"))我希望得到这样的结果:
df <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product stock","yahoo product stock, amazon","yahoo product stock", "google stock"))
combination frequency
1 google stock - yahoo product stock 2
2 amazon - yahoo product stock 2
3 yahoo product stock 1
4 google stock 1我试过这个:
library(tidyverse)
df %>%
separate_rows(stocks, sep = ",") %>%
full_join(df %>%
separate_rows(stocks), by = c("id" = "id")) %>%
filter(stocks.x != stocks.y) %>%
count(stocks.x, stocks.y) %>%
transmute(stocks = paste(pmax(stocks.x, stocks.y), pmin(stocks.x, stocks.y), sep = "-"),
n) %>%
distinct(stocks, .keep_all = TRUE)但我收到了这个结果
答: 16 x 2股票n1亚马逊雅虎产品2 2产品雅虎产品2 3雅虎产品2 4谷歌雅虎产品库存2 5产品-雅虎产品库存2 6库存雅虎产品库存4 7雅虎产品库存2 8产品-亚马逊2 9雅虎-亚马逊2 10谷歌股票-谷歌3 11产品-谷歌股票2 12股票-谷歌股票5 13雅虎-谷歌股票2 14雅虎产品库存-产品1 15雅虎产品库存-股票1 16雅虎产品库存-雅虎1
使用table()不是我的情况下的最佳解决方案,因为我真正的数据集更多数据
发布于 2019-05-13 14:30:04
你在找这样的东西吗(下面)。如果是这样的话,我将注释每一步。基本上,它根据逗号拆分字符串,清理空格,对分割的部分进行排序,用“-”将它们折叠在一起,并使用dpylr函数获取计数。我做了很多假设,所以如果它对你不起作用,请告诉我。而且,在data.table 取决于组数中这样做可能更快,但我坚持使用dplyr,因为这正是您所使用的。祝你好运!
split_stock <- lapply(strsplit(as.character(df1$stocks), ",", fixed = T), function(x) sort(trimws(x)))
df1$stocks2 <- sapply(split_stock, paste0, collapse = " - ")
df1 %>%
group_by(stocks2) %>%
count() %>%
arrange(desc(n))
# A tibble: 4 x 2
stocks2 n
<chr> <int>
1 amazon - yahoo product 2
2 google stock - yahoo product stock 2
3 google stock 1
4 yahoo product stock 1数据
df1 <- data.frame(id = c(1,2,3,4,5,6), stocks = c("google stock, yahoo product stock", "google stock, yahoo product stock","amazon, yahoo product","yahoo product, amazon","yahoo product stock", "google stock"))发布于 2019-05-13 14:38:03
您不需要使用full_join()。
使用separate_rows()通过id识别所有的公司stocks,然后在函数paste(collapse = ' ')中使用group_by() / summarise()来连接stocks变量中的不同可能性。最后,根据您的需要使用count()。
df %>%
separate_rows(stocks) %>%
filter(!stocks %in% c('stock', 'product')) %>%
group_by(id) %>%
summarise(group_stocks = paste(sort(stocks), collapse = ' ')) %>%
count(group_stocks)
# group_stocks n
# <chr> <int>
# 1 amazon yahoo 2
# 2 google 1
# 3 google yahoo 2
# 4 yahoo 1https://stackoverflow.com/questions/56114196
复制相似问题