r - working with large lists that become too big for RAM when operated on -
short of working on machine more ram, how can work large lists in r
, example put them on disk , work on sections of it?
here's code generate type of lists i'm using
n = 50; = 100 word <- vector(mode = "integer", length = n) (i in 1:n){ word[i] <- paste(sample(c(rep(0:9,each=5),letters,letters),5,replace=true),collapse='') } dat <- data.frame(word = word, counts = sample(1:50, n, replace = true)) dat_list <- lapply(1:i, function(i) dat)
in actual use case each data frame in list unique, unlike quick example here. i'm aiming n = 4000 , = 100,000
this 1 example of want list of dataframes:
func <- function(x) {rep(x$word, times = x$counts)} la <- lapply(dat_list, func)
with actual use case runs few hours, fills ram , of swap , rstudio freezes , shows message bomb on (rstudio forced terminate due error in r session).
i see bigmemory
limited matrices , ff
doesn't seem handle lists. other options? if sqldf
or related out-of-memory method possible here, how might started? can't enough out of documentation make progress , grateful pointers. note instructions "buy more ram" ignored! package i'm hoping suitable average desktop computers (ie. undergrad computer labs).
update followining on helpful comments simono101 , ari, here's benchmarking comparing dataframes , data.tables, loops , lapply, , , without gc
# self-contained speed test of untable n = 50; = 100 word <- vector(mode = "integer", length = n) (i in 1:n){ word[i] <- paste(sample(c(rep(0:9,each=5),letters,letters),5,replace=true),collapse='') } # data table library(data.table) dat_dt <- data.table(word = word, counts = sample(1:50, n, replace = true)) dat_list_dt <- lapply(1:i, function(i) dat_dt) # data frame dat_df <- data.frame(word = word, counts = sample(1:50, n, replace = true)) dat_list_df <- lapply(1:i, function(i) dat_df) # increase object size y <- 10 dt <- c(rep(dat_list_dt, y)) df <- c(rep(dat_list_df, y)) # untable untable <- function(x) rep(x$word, times = x$counts) # preallocate objects loop fill df1 <- vector("list", length = length(df)) dt1 <- vector("list", length = length(dt)) df3 <- vector("list", length = length(df)) dt3 <- vector("list", length = length(dt)) # functions lapply df_untable_gc <- function(x) { untable(df[[x]]); if (x%%10) invisible(gc()) } dt_untable_gc <- function(x) { untable(dt[[x]]); if (x%%10) invisible(gc()) } # speedtests library(microbenchmark) microbenchmark( for(i in 1:length(df)) { df1[[i]] <- untable(df[[i]]); if (i%%10) invisible(gc()) }, for(i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]); if (i%%10) invisible(gc()) }, df2 <- lapply(1:length(df), function(i) df_untable_gc(i)), dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)), for(i in 1:length(df)) { df3[[i]] <- untable(df[[i]])}, for(i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]])}, df4 <- lapply(1:length(df), function(i) untable(df[[i]]) ), dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]]) ), times = 10)
and here results, without explicit garbage collection, data.table faster , lapply faster loop. explicit garbage collection (as think simono101 might suggesting) same speed - lot slower! know using gc
bit controversial , not helpful in case, i'll give shot actual use-case , see if makes difference. of course don't have data on memory use of these functions, main concern. seems there no function memory benchmarking equivalent timing functions (for windows, anyway).
unit: milliseconds expr (i in 1:length(df)) { df1[[i]] <- untable(df[[i]]) if (i%%10) invisible(gc()) } (i in 1:length(dt)) { dt1[[i]] <- untable(dt[[i]]) if (i%%10) invisible(gc()) } df2 <- lapply(1:length(df), function(i) df_untable_gc(i)) dt2 <- lapply(1:length(dt), function(i) dt_untable_gc(i)) (i in 1:length(df)) { df3[[i]] <- untable(df[[i]]) } (i in 1:length(dt)) { dt3[[i]] <- untable(dt[[i]]) } df4 <- lapply(1:length(df), function(i) untable(df[[i]])) dt4 <- lapply(1:length(dt), function(i) untable(dt[[i]])) min lq median uq max neval 37436.433962 37955.714144 38663.120340 39142.350799 39651.88118 10 37354.456809 38493.268121 38636.424561 38914.726388 39111.20439 10 36959.630896 37924.878498 38314.428435 38636.894810 39537.31465 10 36917.765453 37735.186358 38106.134494 38563.217919 38751.71627 10 28.200943 29.221901 30.205502 31.616041 34.32218 10 10.230519 10.418947 10.665668 12.194847 14.58611 10 26.058039 27.103217 27.560739 28.189448 30.62751 10 8.835168 8.904956 9.214692 9.485018 12.93788 10
if going using large data can use h5r package write hdf5 files. writing , reading hard drive on fly instead of using ram. have not used can of little on it's general usage, mention because think there's no tutorial it. got idea thinking pytables. not sure if solution appropriate you.
Comments
Post a Comment