python - HDF5 - concurrency, compression & I/O performance -

April 15, 2010

i have following questions hdf5 performance , concurrency:

does hdf5 support concurrent write access?
concurrency considerations aside, how hdf5 performance in terms of i/o performance (does compression rates affect performance)?
since use hdf5 python wonder wow it's performance compare sqlite.

references:

updated use pandas 0.13.1

1) no. http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats. there various ways do this, e.g. have different threads/processes write out computation results, have single process combine.

2) depending type of data store, how it, , how want retrieve, hdf5 can offer vastly better performance. storing in hdfstore single array, float data, compressed (in other words, not storing in format allows querying), stored/read amazing fast. storing in table format (which slows down write performance), offer quite write performance. can @ detailed comparsions (which hdfstore uses under hood). http://www.pytables.org/, here's nice picture:

(and since pytables 2.3 queries indexed), perf better answer question, if want kind of performance, hdf5 way go.

writing:

in [14]: %timeit test_sql_write(df) 1 loops, best of 3: 6.24 s per loop  in [15]: %timeit test_hdf_fixed_write(df) 1 loops, best of 3: 237 ms per loop  in [16]: %timeit test_hdf_table_write(df) 1 loops, best of 3: 901 ms per loop  in [17]: %timeit test_csv_write(df) 1 loops, best of 3: 3.44 s per loop

reading

in [18]: %timeit test_sql_read() 1 loops, best of 3: 766 ms per loop  in [19]: %timeit test_hdf_fixed_read() 10 loops, best of 3: 19.1 ms per loop  in [20]: %timeit test_hdf_table_read() 10 loops, best of 3: 39 ms per loop  in [22]: %timeit test_csv_read() 1 loops, best of 3: 620 ms per loop

and here's code

import sqlite3 import os pandas.io import sql  in [3]: df = dataframe(randn(1000000,2),columns=list('ab')) <class 'pandas.core.frame.dataframe'> int64index: 1000000 entries, 0 999999 data columns (total 2 columns):    1000000  non-null values b    1000000  non-null values dtypes: float64(2)  def test_sql_write(df):     if os.path.exists('test.sql'):         os.remove('test.sql')     sql_db = sqlite3.connect('test.sql')     sql.write_frame(df, name='test_table', con=sql_db)     sql_db.close()  def test_sql_read():     sql_db = sqlite3.connect('test.sql')     sql.read_frame("select * test_table", sql_db)     sql_db.close()  def test_hdf_fixed_write(df):     df.to_hdf('test_fixed.hdf','test',mode='w')  def test_csv_read():     pd.read_csv('test.csv',index_col=0)  def test_csv_write(df):     df.to_csv('test.csv',mode='w')      def test_hdf_fixed_read():     pd.read_hdf('test_fixed.hdf','test')  def test_hdf_table_write(df):     df.to_hdf('test_table.hdf','test',format='table',mode='w')  def test_hdf_table_read():     pd.read_hdf('test_table.hdf','test')

of course ymmv.

Search This Blog

Three

python - HDF5 - concurrency, compression & I/O performance -

Comments

Post a Comment

Popular posts from this blog

.htaccess - First slash is removed after domain when entering a webpage in the browser -

Socket.connect doesn't throw exception in Android -

SPSS keyboard combination alters encoding -