这里的初学者有一个包含多个列的大型dataframe,其中一些值被错误地放置,但至少在值前面有正确的列名。想象一下像这样的数据文件:
Country <- c("Spain", "Time:16 Mar 2018 - 23 Apr 2018", "USA")
Platform <- c("Twitter", "Country:Germany", "Cap:200")
Start_Time <- c("10 Jun 2018 - 2 Jul 2018", "Platform:Facebook", "Platform:Instagram")
Cap <- c("300", "500", "Time:10 Jun 2018 - 2 Jul 2018")
dat <- data.frame(Country, Platform, Start_Time, Cap) Output:
Country Platform Start_Time Cap
Spain Twitter 10 Jun 2018 - 2 Jul 2018 300
Time:16 Mar 2018 - 23 Apr 2018 Country:Germany Platform:Facebook 500
USA Cap:200 Platform:Instagram Time:10 Jun 2018 - 2 Jul 2018如您所见,如果值被错误放置,则在值前面设置正确的列名(或至少设置一个与Start_Time和Time:类似的指示)。
如何将这些值转换为它们各自的列?我原来的dataframe有729行和45列,所以手工工作越少越好。正确的输出应该如下所示:
Output:
Country Platform Start_Time Cap
Spain Twitter 10 Jun 2018 - 2 Jul 2018 300
Germany Facebook 16 Mar 2018 - 23 Apr 2018 500
USA Instagram 10 Jun 2018 - 2 Jul 2018 200非常感谢。
编辑:这是dput(df_short_head)的输出,它是我原始数据的前6行
structure(list(Presale_Time = c("16 Apr 2018 - 30 Apr 2018 ",
"Whitelist/KYC:Whitelist + KYC", "Country:Jersey", "ICO Time:26 Mar 2018 - 23 Apr 2018 ",
"", "ICO Time:01 Mar 2018 - 31 Mar 2018 "
), ICO_Time = c("01 May 2018 - 20 July 2018 ",
"Country:Singapore", "", "Country:UK", "", "Country:Malaysia"
), Whitelist_KYC = c("\nWhitelist/KYC:\nWhitelist + KYC\n", "",
"", "", "", ""), Country = c("Spain", "", "", "", "", ""), Platform = c("Ethereum ",
"Ethereum ",
"Ethereum ",
"Scrypt ",
"Total supply:1,087,156,610.00 FXT", "Ethereum "
), Token_Type = c("ERC20", "ERC20", "ERC20", "Scrypt", "", "ERC20"
), Available_for_sale = c("1,008,000,000 CST", "2,200,000,000 ZPR",
"200,000,000 GNY", "20,000,000 SHARD", "", "Total supply:15,000,000,000.00 SRCOIN"
), Total_Supply_2 = c("1,124,463,121.00 CST", "1,850,000,000.00 ZPR",
"400,000,000.00 GNY", "25,391,088.27 SHARD", "", ""), ICO_Price = c(" 0.05 USD",
" 0.0375 USD", "Accepting:BTC, ETH, LSK, ASCH", " 0.57 USD",
"Accepting:ETH, BTC", " 0.006 USD"), Accepting = c("ETH, BTC, Fiat",
"ETH", "Soft cap:1,000,000 USD", "BTC, ETH, LTC, XRP", "Hard cap:40,000 ETH",
"ETH"), Soft_Cap = c("983,733 EUR", "5 000 ETH", "Hard cap:400,000,000 GNY",
"1,500 ETH", "", ""), Hard_Cap = c("71,400,000 EUR", "48 000 ETH",
"Bonuses:20% sale for first 100,000,000 tokens", "12,500 ETH",
"", "")), row.names = c(NA, 6L), class = "data.frame")发布于 2022-07-13 10:08:06
好的,我不知道您是否熟悉tidyverse和dplyr函数,所以我在解决方案中使用了经典函数(考虑到最后一次使用完整格式的更新)。需要注意的是,有些值以“红利:”开头,但是您没有定义这个列,所以代码中没有考虑它。
dat <- structure(list(Presale_Time = c("16 Apr 2018 - 30 Apr 2018 ",
"Whitelist/KYC:Whitelist + KYC", "Country:Jersey", "ICO Time:26 Mar 2018 - 23 Apr 2018 ",
"", "ICO Time:01 Mar 2018 - 31 Mar 2018 "),
ICO_Time = c("01 May 2018 - 20 July 2018 ",
"Country:Singapore", "", "Country:UK", "", "Country:Malaysia"),
Whitelist_KYC = c("\nWhitelist/KYC:\nWhitelist + KYC\n", "",
"", "", "", ""),
Country = c("Spain", "", "", "", "", ""),
Platform = c("Ethereum ",
"Ethereum ",
"Ethereum ",
"Scrypt ",
"Total supply:1,087,156,610.00 FXT", "Ethereum "),
Token_Type = c("ERC20", "ERC20", "ERC20", "Scrypt", "", "ERC20"),
Available_for_sale = c("1,008,000,000 CST", "2,200,000,000 ZPR",
"200,000,000 GNY", "20,000,000 SHARD", "",
"Total supply:15,000,000,000.00 SRCOIN"),
Total_Supply = c("1,124,463,121.00 CST", "1,850,000,000.00 ZPR",
"400,000,000.00 GNY", "25,391,088.27 SHARD", "", ""),
ICO_Price = c(" 0.05 USD",
" 0.0375 USD", "Accepting:BTC, ETH, LSK, ASCH", " 0.57 USD",
"Accepting:ETH, BTC", " 0.006 USD"),
Accepting = c("ETH, BTC, Fiat",
"ETH", "Soft cap:1,000,000 USD", "BTC, ETH, LTC, XRP",
"Hard cap:40,000 ETH",
"ETH"),
Soft_Cap = c("983,733 EUR", "5 000 ETH", "Hard cap:400,000,000 GNY",
"1,500 ETH", "", ""),
Hard_Cap = c("71,400,000 EUR", "48 000 ETH",
"Bonuses:20% sale for first 100,000,000 tokens", "12,500 ETH",
"", "")),
row.names = c(NA, 6L), class = "data.frame")
# Define a vector with (columns') categories
groupNames <- gsub(x = colnames(dat), pattern = "_", replacement = " ")
# Define a function for correcting missplacing
correcting <- function(x, groupNames){
# Coerce input vector as character
x <- gsub(x = as.character(x), pattern = "\n", replacement = "")
x <- gsub(x = x, pattern = "/", replacement = " ")
# Find the missplaced positions
index <- do.call(c, sapply(groupNames, grep, x = x, ignore.case = TRUE))
# If there is any missplaced value...
if(length(index) > 0){
oldValues <- x[index]
x[index] <- NA
# Correct misplacing and remove text used as clue
x[match(names(index), groupNames)] <- gsub(x = oldValues,
pattern = "^[[:print:]]{1,}:",
replacement = "")
}
return(x)
}
# Apply function by row, transposing and coerce output as data frame
out <- as.data.frame(t(apply(dat, 1, correcting, groupNames = groupNames)))
# Replace names of columns
colnames(out) <- colnames(dat)https://stackoverflow.com/questions/72963549
复制相似问题