我有一个大的数据集。我想分成"n“个子数据集,每个子数据集大小相等。然而,如果最后一个数据集不能被数字整除,则它可能小于其他大小。并将其作为csv文件输出到工作目录。
下面是一个小例子:
set.seed(1234)
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
mydf
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 X13
1 3 7 1 9 6 4 7 5 8 2 2 2 8
2 5 3 4 6 9 5 3 10 5 8 10 2 10
3 4 6 10 4 4 6 3 4 2 9 9 2 9
4 10 10 9 4 3 7 7 7 10 6 7 10 2
5 10 3 9 3 2 10 9 6 4 4 4 6 3
6 7 2 8 7 5 5 10 10 9 3 7 8 4
7 3 2 2 7 10 9 2 2 10 1 1 10 4
8 3 9 9 7 3 1 7 6 10 3 10 3 2
9 9 3 6 9 3 2 2 3 4 2 9 10 10
10 6 4 3 3 5 9 3 9 10 7 4 6 10我想创建一个函数,将数据集随机分成n个子集(在本例中,假设大小为3,因为有13列-最后一个数据集有1列,其余4列各有3列),并作为单独的数据集输出为文本文件。
下面是我所做的:
set.seed(123)
reshuffled <- sample(1:length(mydf),length(mydf), replace = FALSE)
# just crazy manual divide
group1 <- reshuffled[1:3]; group2 <- reshuffled[4:6]; group3 <- reshuffled[7:9]
group4 <- reshuffled[10:12]; group5 <- reshuffled[13]
# just manual
data1 <- mydf[,group1]; data2 <- mydf[,group2]; ....so on;
# I want to write dimension of dataset at fist row of each dataset
cat (dim(data1))
write.csv(data1, "data1.csv"); write.csv(data2, "data2.csv"); .....so on 因为我必须生成100个子数据集,所以可以循环该过程吗?
发布于 2012-06-01 06:20:01
也许有一个更干净、更简单的解决方案,但您可以尝试以下方法:
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
## Number of columns for each sub-dataset
size <- 3
nb.cols <- ncol(mydf)
nb.groups <- nb.cols %/% size
reshuffled <- sample.int(nb.cols, replace=FALSE)
groups <- c(rep(1:nb.groups, each=size), rep(nb.groups+1, nb.cols %% size))
dfs <- lapply(split(reshuffled, groups), function(v) mydf[,v,drop=FALSE])
for (i in 1:length(dfs)) write.csv(dfs[[i]], file=paste("data",i,".csv",sep=""))发布于 2012-06-01 07:12:55
只是为了好玩,可能比朱巴慢
mydf <- data.frame (matrix(sample(1:10, 130, replace = TRUE), ncol = 13))
size <- 3
by(t(mydf),
INDICES=sample(as.numeric(gl((ncol(mydf) %/% size) + 1, size, ncol(mydf))),
ncol(mydf),
replace=FALSE),
FUN=function(x) write.csv(t(x), paste(rownames(x), collapse='-'), row.names=F))发布于 2017-08-27 20:52:58
为了将'mydf‘划分为n个几乎相等的部分,我从这个问题和相应的答案中获得了灵感: link。
它创建分区大小,其中最小分区和最大分区之间的差异尽可能小。在此示例中,此差值等于1。示例:
分区方法1-使用‘floor’函数(这里没有显示可重现的代码)。将100行划分为7个几乎相等的部分/总和,然后在前6次迭代中采样floor(100/7) = 14个索引。第七个元素是余数。这会产生以下结果:
14,14,14,14,14,14,16。和= 100,最大差=2
分区方法2-使用‘天花板’函数(这里没有显示可重现的代码)。使用‘gives’-函数而不是‘floor’-函数会得到类似的结果:
15,15,15,15,15,15,10。和= 100,最大差=5
分区方法3-使用上面引用的公式。使用以下步骤时,分区大小的向量('sequence_diff')为:
14,14,14,15,14,14,15。和= 100,最大差=1
R代码:
set.seed(1234)
#I increased the number of rows in the data frame to 100
mydf <- data.frame (matrix(sample(x = 1:100, size = 1300, replace = TRUE),
ncol = 13))
index_list <- list() #Will store the indices for all partitions
indices <- 1:nrow(mydf) #Initially contains all indices for the dataset 'mydf'
numb_partitions <- 7 #Specifies the number of partitions
sequence <- floor(((nrow(mydf)*1:numb_partitions)/numb_partitions))
sequence <- c(0, sequence)
#'sequence_diff' will contain the number of instances for each partition.
sequence_diff <- vector()
for(j in 1:numb_partitions){
sequence_diff[j] <- sequence[j+1] - sequence[j]
}
#Inspect 'sequence_diff' and verify it's elements sum up to the total
#number of rows in 'mydf' (100).
> sequence_diff
[1] 14 14 14 15 14 14 15
> sum(sequence_diff)
[1] 100 #Correct!
for(i in 1:numb_partitions){
#Use a different seed for each sampling iteration.
set.seed(seed = i)
#Sample from object 'indices' of size 1/'numb_partitions'
indices_partition <- sample(x = indices,
size = sequence_diff[i],
replace = FALSE)
#Remove the selected indices from 'indices' so these indices will not be
#selected in successive iterations.
indices <- setdiff(x = indices, y = indices_partition)
#Store the indices for the i-th iteration in the list 'index_list'. This
#is just to verify later that
#the procedure has divided all indices in 'numb_partitions' disjunct sets.
index_list[[i]] <- indices_partition
#Dynamically create a new object that is named 'mydfx' in which x is the
#i-th partition.
assign(x = paste0("mydf", i), value = mydf[indices_partition,])
write.csv(x = get(x = paste0("mydf", i)), #Dynamically get the object from environment.
file = paste0("mydf", i,".csv"), #Dynamically assgin a name to the csv-file.
sep = ",",
col.names = T,
row.names = FALSE
}
#Check whether all index subsets are mutually exclusive: union should have 100
#unique elements.
length(unique(unlist(index_list)))
[1] 100 #Correct!https://stackoverflow.com/questions/10841859
复制相似问题