python - HDF5 taking more space than CSV? -
consider following example:
prepare data:
import string import random import pandas pd matrix = np.random.random((100, 3000)) my_cols = [random.choice(string.ascii_uppercase) x in range(matrix.shape[1])] mydf = pd.dataframe(matrix, columns=my_cols) mydf['something'] = 'hello_world'
set highest compression possible hdf5:
store = pd.hdfstore('myfile.h5',complevel=9, complib='bzip2') store['mydf'] = mydf store.close()
save csv:
mydf.to_csv('myfile.csv', sep=':')
the result is:
myfile.csv
5.6 mb bigmyfile.h5
11 mb big
the difference grows bigger datasets larger.
i have tried other compression methods , levels. bug? (i using pandas 0.11 , latest stable version of hdf5 , python).
copy of answer issue: https://github.com/pydata/pandas/issues/3651
your sample small. hdf5 has fair amount of overhead small sizes (even 300k entries on smaller side). following no compression on either side. floats more efficiently represented in binary (that text representation).
in addition, hdf5 row based. efficiency having tables not wide long. (hence example not efficient in hdf5 @ all, store transposed in case)
i routinely have tables 10m+ rows , query times can in ms. below example small. having 10+gb files quite common (not mention astronomy guys 10gb+ few seconds!)
-rw-rw-r-- 1 jreback users 203200986 may 19 20:58 test.csv -rw-rw-r-- 1 jreback users 88007312 may 19 20:59 test.h5 in [1]: df = dataframe(randn(1000000,10)) in [9]: df out[9]: <class 'pandas.core.frame.dataframe'> int64index: 1000000 entries, 0 999999 data columns (total 10 columns): 0 1000000 non-null values 1 1000000 non-null values 2 1000000 non-null values 3 1000000 non-null values 4 1000000 non-null values 5 1000000 non-null values 6 1000000 non-null values 7 1000000 non-null values 8 1000000 non-null values 9 1000000 non-null values dtypes: float64(10) in [5]: %timeit df.to_csv('test.csv',mode='w') 1 loops, best of 3: 12.7 s per loop in [6]: %timeit df.to_hdf('test.h5','df',mode='w') 1 loops, best of 3: 825 ms per loop in [7]: %timeit pd.read_csv('test.csv',index_col=0) 1 loops, best of 3: 2.35 s per loop in [8]: %timeit pd.read_hdf('test.h5','df') 10 loops, best of 3: 38 ms per loop
i wouldn't worry size (i suspect not, merely interested, fine). point of hdf5 disk cheap, cpu cheap, can't have in memory @ once optimize using chunking
Comments
Post a Comment