下面我已经提到了R.
ID Amount Date
IK-1 100 2020-01-01
IK-2 110 2020-01-02
IK-3 120 2020-01-03
IK-4 109 2020-01-03
IK-5 104 2020-01-03我使用ID从MySQL中获取一些使用以下代码的详细信息。
library(RMySQL)
conn<- connection
query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
where c.ID IN (", paste(shQuote(dataframe$ID, type = "sh"),
collapse = ', '),")
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
res1 <- dbGetQuery(conn,query)
res2<-res1[res1$Parameters=="Section1",4:5]
colnames(res2)[colnames(res2)=="status"] <- "Section1_Status"上面的代码运行良好,如果我传递了~1000个ID,但是当每次传递10000或更多ID时,它会引发R终止错误。
如何创建一个循环并分批传递ID以获得10000 Id的最后一个输出。
错误信息:
Warning message:
In dbFetch(rs, n = n, ...) : error while fetching rows发布于 2020-02-12 16:16:41
# Load Packages
library(dplyr) # only needed to create the initial dataframe
library(RMySQL)
# create the initial dataframe
df <- tribble(
~ID, ~Amount, ~Date
, "IK-1" , 100 , 2020-01-01
, "IK-2" , 110 , 2020-01-02
, "IK-3" , 120 , 2020-01-03
, "IK-4" , 109 , 2020-01-03
, "IK-5" , 104 , 2020-01-03
)
# first helper function
createIDBatchVector <- function(x, batchSize){
paste0(
"'"
, sapply(
split(x, ceiling(seq_along(x) / batchSize))
, paste
, collapse = "','"
)
, "'"
)
}
# second helper function
createQueries <- function(IDbatches){
paste0("
SELECT c.ID,e.Parameters, d.status
FROM Table1 c
LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
LEFT OUTER JOIN Table3 e ON e.role_id = d.role
WHERE c.ID IN (", IDbatches,")
AND e.Parameters in ('Section1','Section2','Section3','Section4');
")
}
# ------------------------------------------------------------------
# and now the actual script
# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(df$ID, 2)
# It looks like this:
# [1] "'IK-1','IK-2'" "'IK-3','IK-4'" "'IK-5'"
# now we create a vector of SQL-queries out of that
queries <- createQueries(IDbatches)
cat(queries) # use cat to show what they look like
# it looks like this:
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-1','IK-2')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-3','IK-4')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
#
# SELECT c.ID,e.Parameters, d.status
# FROM Table1 c
# LEFT OUTER JOIN Table2 d ON d.seq_id =c.ID
# LEFT OUTER JOIN Table3 e ON e.role_id = d.role
# WHERE c.ID IN ('IK-5')
# AND e.Parameters in ('Section1','Section2','Section3','Section4');
# and now the loop
df_final <- data.frame() # initialize a dataframe
conn <- connection # open a connection
for (query in queries){ # iterate over the queries
df_final <- rbind(df_final, dbGetQuery(conn,query))
}
# And here the connection should be closed. (I don't know the function call for this.)发布于 2020-02-07 16:35:51
在SQL查询之前将ID的数据帧传递到临时表中,然后使用它在您正在使用的ID上进行内部连接,这样可以避免循环。您所要做的就是使用dbWriteTable并在调用它时设置参数temporary = TRUE。
例:
library(DBI)
library(RMySQL)
con <- dbConnect(RMySQL::MySQL(), user='user',
password='password', dbname='database_name', host='host')
#here we write the table into the DB and then declare it as temporary
dbWriteTable(conn = con, value = dataframe, name = "id_frame", temporary = T)
res1 <- dbGetQuery(con = conn, "SELECT c.ID,e.Parameters, d.status
FROM Table1 c
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
Inner join id_frame idf on idf.ID = c.ID
and e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")这将提高代码的性能,并且不再需要使用where语句在R中循环。如果不正常请告诉我。
发布于 2020-02-06 20:54:57
正如@A.Suliman提供的链接所显示的,这很可能是因为in -子句中有大量的值。以下是一些尝试的解决方案:
批处理
我是一个球迷使用模块化的批处理过程。这假设您正在批处理的ID值是数字的:
num_batches = 100
output_list = list()
for(i in 1:num_batches){
this_subset = filter(dataframe, ID %% num_batches == (i-1))
# subsequent processing using this_subset
output_list[i] = results_from_subsetting
}
output = data.table::rbindlist(output_list)在您的例子中,ID看起来是XX-123形式(两个字符,一个连字符,后面跟着一些数字)。您可以使用:just_number_part = substr(ID, 4, nchar(ID))将其转换为一个数字。
临时文件写入
如果要将dataframe从R写到sql,那么就不需要这么大的IN-子句,而可以使用join。dbplyr包包括一个可用于向数据库写入临时表的函数copy_to。
这看起来应该是:
library(RMySQL)
library(dbplyr)
conn<- connection
copy_to(conn, dataframe, name = "my_table_name") # copy local table to mysql
query<-paste0("SELECT c.ID,e.Parameters, d.status
FROM Table1 c
INNER JOIN my_table_name a ON a.ID = c.ID # replace IN-clause with inner join
left outer join Table2 d ON d.seq_id=c.ID
LEFT outer JOIN Table3 e ON e.role_id=d.role
WHERE e.Parameters in
('Section1',
'Section2','Section3',
'Section4');")
res1 <- dbGetQuery(conn,query)作为参考,我推荐tidyverse文档。您可能还会发现这个问题在使用copy_to编写时对调试很有帮助。
增加超时延迟
当一个-子句中有很多值时,执行查询的速度会慢得多,因为-子句本质上被转换为一个OR语句序列。
根据此链接,您可以通过以下方式更改MySQL的超时选项:
wait_timeout = 28800interactive_timeout = 28800
https://stackoverflow.com/questions/60035473
复制相似问题