r - working with large lists that become too big for RAM when operated on -

short of working on machine more ram, how can work large lists in r, example put them on disk , work on sections of it?

here's code generate type of lists i'm using

n = 50; = 100 word <- vector(mode = "integer", length = n) (i in 1:n){   word[i] <- paste(sample(c(rep(0:9,each=5),letters,letters),5,replace=true),collapse='') } dat <- data.frame(word =  word,                   counts = sample(1:50, n, replace = true)) dat_list <- lapply(1:i, function(i) dat)

in actual use case each data frame in list unique, unlike quick example here. i'm aiming n = 4000 , = 100,000

this 1 example of want list of dataframes:

func <- function(x) {rep(x$word, times = x$counts)} la <- lapply(dat_list, func)

with actual use case runs few hours, fills ram , of swap , rstudio freezes , shows message bomb on (rstudio forced terminate due error in r session).

i see bigmemory limited matrices , ff doesn't seem handle lists. other options? if sqldf or related out-of-memory method possible here, how might started? can't enough out of documentation make progress , grateful pointers. note instructions "buy more ram" ignored! package i'm hoping suitable average desktop computers (ie. undergrad computer labs).

update followining on helpful comments simono101 , ari, here's benchmarking comparing dataframes , data.tables, loops , lapply, , , without gc

# self-contained speed test of untable n = 50; = 100 word <- vector(mode = "integer", length = n) (i in 1:n){   word[i] <- paste(sample(c(rep(0:9,each=5),letters,letters),5,replace=true),collapse='') } # data table library(data.table) dat_dt <- data.table(word = word, counts = sample(1:50, n, replace = true)) dat_list_dt <- lapply(1:i, function(i) dat_dt)  # data frame dat_df <- data.frame(word =  word, counts = sample(1:50, n, replace = true)) dat_list_df <- lapply(1:i, function(i) dat_df)  # increase object size y <- 10 dt <- c(rep(dat_list_dt, y)) df <- c(rep(dat_list_df, y)) # untable untable <- function(x) rep(x$word, times = x$counts)   # preallocate objects loop fill df1 <- vector("list", length = length(df)) dt1 <- vector("list", length = length(dt)) df3 <- vector("list", length = length(df)) dt3 <- vector("list", length = length(dt)) # functions lapply df_untable_gc <- function(x) { untable(df[[x]]); if (x%%10) invisible(gc()) } dt_untable_gc <- function(x) { untable(dt[[x]]); if (x%%10) invisible(gc()) } # speedtests library(microbenchmark) microbenchmark(   for(i in 1:length(df)) { df1[[i]] <- untable(df[[i]]); if (i%%10) invisible(gc()) },   for(i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]); if (i%%10) invisible(gc()) },   df2 <- lapply(1:length(df), function(i) df_untable_gc(i)),   dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)),   for(i in 1:length(df)) { df3[[i]] <- untable(df[[i]])},   for(i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]])},   df4 <- lapply(1:length(df), function(i) untable(df[[i]]) ),   dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]) ),    times = 10)

and here results, without explicit garbage collection, data.table faster , lapply faster loop. explicit garbage collection (as think simono101 might suggesting) same speed - lot slower! know using gc bit controversial , not helpful in case, i'll give shot actual use-case , see if makes difference. of course don't have data on memory use of these functions, main concern. seems there no function memory benchmarking equivalent timing functions (for windows, anyway).

unit: milliseconds                                                                                                  expr  (i in 1:length(df)) {     df1[[i]] <- untable(df[[i]])     if (i%%10)          invisible(gc()) }  (i in 1:length(dt)) {     dt1[[i]] <- untable(dt[[i]])     if (i%%10)          invisible(gc()) }                                             df2 <- lapply(1:length(df), function(i) df_untable_gc(i))                                             dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i))                                          (i in 1:length(df)) {     df3[[i]] <- untable(df[[i]]) }                                          (i in 1:length(dt)) {     dt3[[i]] <- untable(dt[[i]]) }                                             df4 <- lapply(1:length(df), function(i) untable(df[[i]]))                                             dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]))           min           lq       median           uq         max neval  37436.433962 37955.714144 38663.120340 39142.350799 39651.88118    10  37354.456809 38493.268121 38636.424561 38914.726388 39111.20439    10  36959.630896 37924.878498 38314.428435 38636.894810 39537.31465    10  36917.765453 37735.186358 38106.134494 38563.217919 38751.71627    10     28.200943    29.221901    30.205502    31.616041    34.32218    10     10.230519    10.418947    10.665668    12.194847    14.58611    10     26.058039    27.103217    27.560739    28.189448    30.62751    10      8.835168     8.904956     9.214692     9.485018    12.93788    10

if going using large data can use h5r package write hdf5 files. writing , reading hard drive on fly instead of using ram. have not used can of little on it's general usage, mention because think there's no tutorial it. got idea thinking pytables. not sure if solution appropriate you.

Search This Blog

Three

r - working with large lists that become too big for RAM when operated on -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -