python - HDF5 - concurrency, compression & I/O performance -
i have following questions hdf5 performance , concurrency:
- does hdf5 support concurrent write access?
- concurrency considerations aside, how hdf5 performance in terms of i/o performance (does compression rates affect performance)?
- since use hdf5 python wonder wow it's performance compare sqlite.
references:
updated use pandas 0.13.1
1) no. http://pandas.pydata.org/pandas-docs/dev/io.html#notes-caveats. there various ways do this, e.g. have different threads/processes write out computation results, have single process combine.
2) depending type of data store, how it, , how want retrieve, hdf5 can offer vastly better performance. storing in hdfstore
single array, float data, compressed (in other words, not storing in format allows querying), stored/read amazing fast. storing in table format (which slows down write performance), offer quite write performance. can @ detailed comparsions (which hdfstore
uses under hood). http://www.pytables.org/, here's nice picture:
(and since pytables 2.3 queries indexed), perf better answer question, if want kind of performance, hdf5 way go.
writing:
in [14]: %timeit test_sql_write(df) 1 loops, best of 3: 6.24 s per loop in [15]: %timeit test_hdf_fixed_write(df) 1 loops, best of 3: 237 ms per loop in [16]: %timeit test_hdf_table_write(df) 1 loops, best of 3: 901 ms per loop in [17]: %timeit test_csv_write(df) 1 loops, best of 3: 3.44 s per loop
reading
in [18]: %timeit test_sql_read() 1 loops, best of 3: 766 ms per loop in [19]: %timeit test_hdf_fixed_read() 10 loops, best of 3: 19.1 ms per loop in [20]: %timeit test_hdf_table_read() 10 loops, best of 3: 39 ms per loop in [22]: %timeit test_csv_read() 1 loops, best of 3: 620 ms per loop
and here's code
import sqlite3 import os pandas.io import sql in [3]: df = dataframe(randn(1000000,2),columns=list('ab')) <class 'pandas.core.frame.dataframe'> int64index: 1000000 entries, 0 999999 data columns (total 2 columns): 1000000 non-null values b 1000000 non-null values dtypes: float64(2) def test_sql_write(df): if os.path.exists('test.sql'): os.remove('test.sql') sql_db = sqlite3.connect('test.sql') sql.write_frame(df, name='test_table', con=sql_db) sql_db.close() def test_sql_read(): sql_db = sqlite3.connect('test.sql') sql.read_frame("select * test_table", sql_db) sql_db.close() def test_hdf_fixed_write(df): df.to_hdf('test_fixed.hdf','test',mode='w') def test_csv_read(): pd.read_csv('test.csv',index_col=0) def test_csv_write(df): df.to_csv('test.csv',mode='w') def test_hdf_fixed_read(): pd.read_hdf('test_fixed.hdf','test') def test_hdf_table_write(df): df.to_hdf('test_table.hdf','test',format='table',mode='w') def test_hdf_table_read(): pd.read_hdf('test_table.hdf','test')
of course ymmv.
Comments
Post a Comment