我使用以下两个函数查找字符串中的国家名称,匹配名称,将其放入dataframe的新列中,然后从原始字符串中删除国家名称:
library("stringr")
ListofCountries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")
CoffeeTable <- data.frame(Product=c("Kenya Ndumberi", "Kenya Ndumberi", "Finca Nombre de Dios", "Finca La Providencia", "Las Penidas", "Las Penidas", "Las Penidas", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Panama Duncan", "Progresso", "Progresso", "Progresso", "Progresso", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "Finca El Injerto", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "\nEl Socoro Reserva Don Diego", "El Socoro Reserva Don Diego", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Bufcafe Natural Sundried Microlot", "Geisha", "Geisha", "Geisha", "Pacamara", "Pacamara", "Pacamara", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Pacamara", "Bolivia", "Brazil yellow bourbon pea berry", "Finca El Vintilador", "\nWashed Yirgacheffe", "Finca El Vintilador", "Washed Yirgacheffe", "Washed Yirgacheffe", "Washed Yirgacheffe", "Leza", "Finca La Libertad", "Pacamara", "Pacamara", "Pacamara", "Finca La Bolsa", "Thunguri Kenya", "Thunguri Kenya", "Thunguri Kenya", "Thiriku Nyeri", "Thiriku Nyeri", "Thiriku Nyeri", "Pedregal", "Pedregal", "Barrel Aged", "Pedregal", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "Amigo de Buesaco", "Amigo de Buesaco", "Amigo de Buesaco", "Barrel Aged", "Toarco Jaya Peaberry Sulawesi", "\nToarco Jaya Peaberry Sulawesi", "El Cypress", "El Cypress", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro", "Kenya Kia Oro"))
CoffeeTable$Country <- str_trim(str_match(tolower(CoffeeTable$Product),
tolower(paste(ListofCountries, collapse="|")))[,1])
CoffeeTable$Product <- str_trim(gsub(tolower(paste(ListofCountries, collapse="|")), replacement="",
CoffeeTable$Product, ignore.case=T))问题1-这是非常缓慢的。如何使这些功能更快?
问题2-这只能得到国家的正式名称。有谁知道一个很好的共同国名清单吗?(例如“中国”与“人民民主共和国中国”)
谢谢!
编辑:这里列出了90个咖啡名的列表,使其成为一个可重复的示例;我想补充一下,在我的实际应用程序中,CoffeeTable已经存在,有大约2,000行和45列。我并不是在寻找更快的方法来构造data.frame / etc。
谢谢!
编辑2:问题2已经回答了,现在我只是尝试优化两个函数,这样它们就不会运行5-10秒了!
发布于 2013-12-18 05:56:19
对于第二个问题,有一个广泛的选项列表here。试试这个:
countries <- read.table(file="https://raw.github.com/umpirsky/country-list/master/country/cldr/en/country.csv",header=T,sep=",")编辑:回应OP的评论。
给定您提供的示例数据,并复制25X以创建与实际数据相同的行数,您的代码运行时间约为1.6秒。很难相信你的系统和我的系统之间有8倍的差别,所以肯定还有别的事情发生。
我唯一能推荐的是查看strapplyc(...)包中的gsubfn。这应该是非常有效的,但在我的系统上实际上比您的代码慢。
有关示例和基准测试,请参见下面的代码。抱歉我帮不了你.
library(stringr)
df <- CoffeeTable
df$Product=as.vector(df$Product)
df=rbind(df,df,df,df,df) # replicate 25X
df=rbind(df,df,df,df,df) # total rows = 2250
pattern <- tolower(paste(ListofCountries$name,collapse="|"))
f1 = function(df){
df$Country <- str_trim(str_match(tolower(df$Product), pattern)[,1])
df$Product <- str_trim(gsub(pattern, "",df$Product, ignore.case=T))
return(df)
}
library(gsubfn)
library(tcltk2)
f2 = function(df){
df$Country <- strapplyc(tolower(df$Product),pattern)
df$Product <- str_trim(gsub(pattern,"", df$Product, ignore.case=T))
return(df)
}
library(microbenchmark)
microbenchmark(df1<-f1(df),df2<-f2(df),times=10)
# Unit: seconds
# expr min lq median uq max neval
# df1 <- f1(df) 1.365222 1.506017 1.611458 1.689611 1.722626 10
# df2 <- f2(df) 2.006162 2.055963 2.148158 2.249707 2.285955 10发布于 2013-12-19 09:34:49
好吧,回到第一个问题。这可能不是最有效的解决方案,但有效。
我建议的第一件事是在最初生成stringsAsFactors = FALSE数据帧时指定CoffeeTable。否则,你最终会受到一些因素的影响。我还将这个表中的初始数据列重命名为Composite,以便您可以看到分离的结果。
match <- gregexpr(tolower(paste(ListofCountries$name, collapse="|")),
tolower(CoffeeTable$Composite))
CoffeeTable$Country <- sapply(regmatches(CoffeeTable$Composite, match),
function(m) {ifelse(length(m), m, "")})
CoffeeTable$Product <- sapply(regmatches(CoffeeTable$Composite, match, invert = TRUE),\
function(m) {paste0(m, collapse = "")})结果如下:
> head(CoffeeTable, 10)
Composite Country Product
1 Kenya Ndumberi Kenya Ndumberi
2 Kenya Ndumberi Kenya Ndumberi
3 Finca Nombre de Dios Finca Nombre de Dios
4 Finca La Providencia Finca La Providencia
5 Las Penidas Las Penidas
6 Las Penidas Las Penidas
7 Las Penidas Las Penidas
8 Panama Duncan Panama Duncan
9 Panama Duncan Panama Duncan
10 Panama Duncan Panama Duncanhttps://stackoverflow.com/questions/20649093
复制相似问题