r - Speed up `strsplit` when possible output are known -

i have large data frame factor column need divide 3 factor columns splitting factor names delimiter. here current approach, slow large data frame (sometimes several million rows):

data <- readrds("data.rds") data.df <- reshape2:::melt.array(data) head(data.df) ##  time location    class replicate population ##1    1        1 lide.1.s         1 0.03859605 ##2    2        1 lide.1.s         1 0.03852957 ##3    3        1 lide.1.s         1 0.03846853 ##4    4        1 lide.1.s         1 0.03841260 ##5    5        1 lide.1.s         1 0.03836147 ##6    6        1 lide.1.s         1 0.03831485  rprof("str.out") cl <- which(names(data.df)=="class") classes <- do.call(rbind, strsplit(as.character(data.df$class), "\\.")) colnames(classes) <- c("species", "sizeclass", "infected") data.df <- cbind(data.df[,1:(cl-1)],classes,data.df[(cl+1):(ncol(data.df))]) rprof(null)  head(data.df) ##  time location species sizeclass infected replicate population ##1    1        1    lide         1        s         1 0.03859605 ##2    2        1    lide         1        s         1 0.03852957 ##3    3        1    lide         1        s         1 0.03846853 ##4    4        1    lide         1        s         1 0.03841260 ##5    5        1    lide         1        s         1 0.03836147 ##6    6        1    lide         1        s         1 0.03831485  summaryrprof("str.out")  $by.self                  self.time self.pct total.time total.pct "strsplit"            1.34    50.00       1.34     50.00 "<anonymous>"         1.16    43.28       1.16     43.28 "do.call"             0.04     1.49       2.54     94.78 "unique.default"      0.04     1.49       0.04      1.49 "data.frame"          0.02     0.75       0.12      4.48 "is.factor"           0.02     0.75       0.02      0.75 "match"               0.02     0.75       0.02      0.75 "structure"           0.02     0.75       0.02      0.75 "unlist"              0.02     0.75       0.02      0.75  $by.total                        total.time total.pct self.time self.pct "do.call"                    2.54     94.78      0.04     1.49 "strsplit"                   1.34     50.00      1.34    50.00 "<anonymous>"                1.16     43.28      1.16    43.28 "cbind"                      0.14      5.22      0.00     0.00 "data.frame"                 0.12      4.48      0.02     0.75 "as.data.frame.matrix"       0.08      2.99      0.00     0.00 "as.data.frame"              0.08      2.99      0.00     0.00 "as.factor"                  0.08      2.99      0.00     0.00 "factor"                     0.06      2.24      0.00     0.00 "unique.default"             0.04      1.49      0.04     1.49 "unique"                     0.04      1.49      0.00     0.00 "is.factor"                  0.02      0.75      0.02     0.75 "match"                      0.02      0.75      0.02     0.75 "structure"                  0.02      0.75      0.02     0.75 "unlist"                     0.02      0.75      0.02     0.75 "[.data.frame"               0.02      0.75      0.00     0.00 "["                          0.02      0.75      0.00     0.00  $sample.interval [1] 0.02  $sampling.time [1] 2.68 

is there way speed operation? note there small (<5) number of each of categories "species", "sizeclass", , "infected", , know these in advance.


  • stringr::str_split_fixed performs task, not faster
  • the data frame generated calling reshape::melt on array in class , associated levels dimension. if there's faster way there here, great.
  • data.rds @ http://dl.getdropbox.com/u/3356641/data.rds

this should offer quite increase:

library(data.table) dt <- data.table(data.df)   dt[, c("species", "sizeclass", "infected")        := as.list(strsplit(class, "\\.")[[1]]), by=class ] 

the reasons increase:

  1. data.table pre allocates memory columns
  2. every column assignment in data.frame reassigns entirety of data (data.table in contrast not)
  3. the by statement allows implement strsplit task once per each unique value.

here nice quick method whole process.

# save new col names character vector  newcols <- c("species", "sizeclass", "infected")   # split string, convert new cols columns dt[, c(newcols) := as.list(strsplit(as.character(class), "\\.")[[1]]), by=class ] dt[, c(newcols) := lapply(.sd, factor), .sdcols=newcols]  # remove old column. instantaneous.  dt[, class := null]  ## have look:  dt[, lapply(.sd, class)] #       time location replicate population species sizeclass infected # 1: integer  integer   integer    numeric  factor    factor   factor  dt 


Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

Socket.connect doesn't throw exception in Android -