python - HDF5 taking more space than CSV? -

August 15, 2013

consider following example:

prepare data:

import string import random import pandas pd  matrix = np.random.random((100, 3000)) my_cols = [random.choice(string.ascii_uppercase) x in range(matrix.shape[1])] mydf = pd.dataframe(matrix, columns=my_cols) mydf['something'] = 'hello_world'

set highest compression possible hdf5:

store = pd.hdfstore('myfile.h5',complevel=9, complib='bzip2') store['mydf'] = mydf store.close()

save csv:

mydf.to_csv('myfile.csv', sep=':')

the result is:

myfile.csv 5.6 mb big
myfile.h5 11 mb big

the difference grows bigger datasets larger.

i have tried other compression methods , levels. bug? (i using pandas 0.11 , latest stable version of hdf5 , python).

copy of answer issue: https://github.com/pydata/pandas/issues/3651

your sample small. hdf5 has fair amount of overhead small sizes (even 300k entries on smaller side). following no compression on either side. floats more efficiently represented in binary (that text representation).

in addition, hdf5 row based. efficiency having tables not wide long. (hence example not efficient in hdf5 @ all, store transposed in case)

i routinely have tables 10m+ rows , query times can in ms. below example small. having 10+gb files quite common (not mention astronomy guys 10gb+ few seconds!)

-rw-rw-r--  1 jreback users 203200986 may 19 20:58 test.csv -rw-rw-r--  1 jreback users  88007312 may 19 20:59 test.h5  in [1]: df = dataframe(randn(1000000,10))  in [9]: df out[9]:  <class 'pandas.core.frame.dataframe'> int64index: 1000000 entries, 0 999999 data columns (total 10 columns): 0    1000000  non-null values 1    1000000  non-null values 2    1000000  non-null values 3    1000000  non-null values 4    1000000  non-null values 5    1000000  non-null values 6    1000000  non-null values 7    1000000  non-null values 8    1000000  non-null values 9    1000000  non-null values dtypes: float64(10)  in [5]: %timeit df.to_csv('test.csv',mode='w') 1 loops, best of 3: 12.7 s per loop  in [6]: %timeit df.to_hdf('test.h5','df',mode='w') 1 loops, best of 3: 825 ms per loop  in [7]: %timeit pd.read_csv('test.csv',index_col=0) 1 loops, best of 3: 2.35 s per loop  in [8]: %timeit pd.read_hdf('test.h5','df') 10 loops, best of 3: 38 ms per loop

i wouldn't worry size (i suspect not, merely interested, fine). point of hdf5 disk cheap, cpu cheap, can't have in memory @ once optimize using chunking

Search This Blog

Three