r - Speed up `strsplit` when possible output are known -
i have large data frame factor column need divide 3 factor columns splitting factor names delimiter. here current approach, slow large data frame (sometimes several million rows):
data <- readrds("data.rds") data.df <- reshape2:::melt.array(data) head(data.df) ## time location class replicate population ##1 1 1 lide.1.s 1 0.03859605 ##2 2 1 lide.1.s 1 0.03852957 ##3 3 1 lide.1.s 1 0.03846853 ##4 4 1 lide.1.s 1 0.03841260 ##5 5 1 lide.1.s 1 0.03836147 ##6 6 1 lide.1.s 1 0.03831485 rprof("str.out") cl <- which(names(data.df)=="class") classes <- do.call(rbind, strsplit(as.character(data.df$class), "\\.")) colnames(classes) <- c("species", "sizeclass", "infected") data.df <- cbind(data.df[,1:(cl-1)],classes,data.df[(cl+1):(ncol(data.df))]) rprof(null) head(data.df) ## time location species sizeclass infected replicate population ##1 1 1 lide 1 s 1 0.03859605 ##2 2 1 lide 1 s 1 0.03852957 ##3 3 1 lide 1 s 1 0.03846853 ##4 4 1 lide 1 s 1 0.03841260 ##5 5 1 lide 1 s 1 0.03836147 ##6 6 1 lide 1 s 1 0.03831485 summaryrprof("str.out") $by.self self.time self.pct total.time total.pct "strsplit" 1.34 50.00 1.34 50.00 "<anonymous>" 1.16 43.28 1.16 43.28 "do.call" 0.04 1.49 2.54 94.78 "unique.default" 0.04 1.49 0.04 1.49 "data.frame" 0.02 0.75 0.12 4.48 "is.factor" 0.02 0.75 0.02 0.75 "match" 0.02 0.75 0.02 0.75 "structure" 0.02 0.75 0.02 0.75 "unlist" 0.02 0.75 0.02 0.75 $by.total total.time total.pct self.time self.pct "do.call" 2.54 94.78 0.04 1.49 "strsplit" 1.34 50.00 1.34 50.00 "<anonymous>" 1.16 43.28 1.16 43.28 "cbind" 0.14 5.22 0.00 0.00 "data.frame" 0.12 4.48 0.02 0.75 "as.data.frame.matrix" 0.08 2.99 0.00 0.00 "as.data.frame" 0.08 2.99 0.00 0.00 "as.factor" 0.08 2.99 0.00 0.00 "factor" 0.06 2.24 0.00 0.00 "unique.default" 0.04 1.49 0.04 1.49 "unique" 0.04 1.49 0.00 0.00 "is.factor" 0.02 0.75 0.02 0.75 "match" 0.02 0.75 0.02 0.75 "structure" 0.02 0.75 0.02 0.75 "unlist" 0.02 0.75 0.02 0.75 "[.data.frame" 0.02 0.75 0.00 0.00 "[" 0.02 0.75 0.00 0.00 $sample.interval [1] 0.02 $sampling.time [1] 2.68
is there way speed operation? note there small (<5) number of each of categories "species", "sizeclass", , "infected", , know these in advance.
notes:
stringr::str_split_fixed
performs task, not faster- the data frame generated calling
reshape::melt
on array inclass
, associated levels dimension. if there's faster way there here, great. data.rds
@ http://dl.getdropbox.com/u/3356641/data.rds
this should offer quite increase:
library(data.table) dt <- data.table(data.df) dt[, c("species", "sizeclass", "infected") := as.list(strsplit(class, "\\.")[[1]]), by=class ]
the reasons increase:
data.table
pre allocates memory columns- every column assignment in data.frame reassigns entirety of data (data.table in contrast not)
- the
by
statement allows implementstrsplit
task once per each unique value.
here nice quick method whole process.
# save new col names character vector newcols <- c("species", "sizeclass", "infected") # split string, convert new cols columns dt[, c(newcols) := as.list(strsplit(as.character(class), "\\.")[[1]]), by=class ] dt[, c(newcols) := lapply(.sd, factor), .sdcols=newcols] # remove old column. instantaneous. dt[, class := null] ## have look: dt[, lapply(.sd, class)] # time location replicate population species sizeclass infected # 1: integer integer integer numeric factor factor factor dt
Comments
Post a Comment