r - Speed up `strsplit` when possible output are known -

August 15, 2015

i have large data frame factor column need divide 3 factor columns splitting factor names delimiter. here current approach, slow large data frame (sometimes several million rows):

data <- readrds("data.rds") data.df <- reshape2:::melt.array(data) head(data.df) ##  time location    class replicate population ##1    1        1 lide.1.s         1 0.03859605 ##2    2        1 lide.1.s         1 0.03852957 ##3    3        1 lide.1.s         1 0.03846853 ##4    4        1 lide.1.s         1 0.03841260 ##5    5        1 lide.1.s         1 0.03836147 ##6    6        1 lide.1.s         1 0.03831485  rprof("str.out") cl <- which(names(data.df)=="class") classes <- do.call(rbind, strsplit(as.character(data.df$class), "\\.")) colnames(classes) <- c("species", "sizeclass", "infected") data.df <- cbind(data.df[,1:(cl-1)],classes,data.df[(cl+1):(ncol(data.df))]) rprof(null)  head(data.df) ##  time location species sizeclass infected replicate population ##1    1        1    lide         1        s         1 0.03859605 ##2    2        1    lide         1        s         1 0.03852957 ##3    3        1    lide         1        s         1 0.03846853 ##4    4        1    lide         1        s         1 0.03841260 ##5    5        1    lide         1        s         1 0.03836147 ##6    6        1    lide         1        s         1 0.03831485  summaryrprof("str.out")  $by.self                  self.time self.pct total.time total.pct "strsplit"            1.34    50.00       1.34     50.00 "<anonymous>"         1.16    43.28       1.16     43.28 "do.call"             0.04     1.49       2.54     94.78 "unique.default"      0.04     1.49       0.04      1.49 "data.frame"          0.02     0.75       0.12      4.48 "is.factor"           0.02     0.75       0.02      0.75 "match"               0.02     0.75       0.02      0.75 "structure"           0.02     0.75       0.02      0.75 "unlist"              0.02     0.75       0.02      0.75  $by.total                        total.time total.pct self.time self.pct "do.call"                    2.54     94.78      0.04     1.49 "strsplit"                   1.34     50.00      1.34    50.00 "<anonymous>"                1.16     43.28      1.16    43.28 "cbind"                      0.14      5.22      0.00     0.00 "data.frame"                 0.12      4.48      0.02     0.75 "as.data.frame.matrix"       0.08      2.99      0.00     0.00 "as.data.frame"              0.08      2.99      0.00     0.00 "as.factor"                  0.08      2.99      0.00     0.00 "factor"                     0.06      2.24      0.00     0.00 "unique.default"             0.04      1.49      0.04     1.49 "unique"                     0.04      1.49      0.00     0.00 "is.factor"                  0.02      0.75      0.02     0.75 "match"                      0.02      0.75      0.02     0.75 "structure"                  0.02      0.75      0.02     0.75 "unlist"                     0.02      0.75      0.02     0.75 "[.data.frame"               0.02      0.75      0.00     0.00 "["                          0.02      0.75      0.00     0.00  $sample.interval [1] 0.02  $sampling.time [1] 2.68

is there way speed operation? note there small (<5) number of each of categories "species", "sizeclass", , "infected", , know these in advance.

notes:

stringr::str_split_fixed performs task, not faster
the data frame generated calling reshape::melt on array in class , associated levels dimension. if there's faster way there here, great.
data.rds @ http://dl.getdropbox.com/u/3356641/data.rds

this should offer quite increase:

library(data.table) dt <- data.table(data.df)   dt[, c("species", "sizeclass", "infected")        := as.list(strsplit(class, "\\.")[[1]]), by=class ]

the reasons increase:

data.table pre allocates memory columns
every column assignment in data.frame reassigns entirety of data (data.table in contrast not)
the by statement allows implement strsplit task once per each unique value.

here nice quick method whole process.

# save new col names character vector  newcols <- c("species", "sizeclass", "infected")   # split string, convert new cols columns dt[, c(newcols) := as.list(strsplit(as.character(class), "\\.")[[1]]), by=class ] dt[, c(newcols) := lapply(.sd, factor), .sdcols=newcols]  # remove old column. instantaneous.  dt[, class := null]  ## have look:  dt[, lapply(.sd, class)] #       time location replicate population species sizeclass infected # 1: integer  integer   integer    numeric  factor    factor   factor  dt

Search This Blog

Three

r - Speed up `strsplit` when possible output are known -

Comments

Post a Comment

Popular posts from this blog

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -

iphone - How do I keep MDScrollView from truncating my row headers and making my cells look bad? -