python - What is the performance impact of non-unique indexes in pandas? -


from pandas documentation, i've gathered unique-valued indices make operations efficient, , non-unique indices tolerated.

from outside, doesn't non-unique indices taken advantage of in way. example, following ix query slow enough seems scanning entire dataframe

in [23]: import numpy np in [24]: import pandas pd in [25]: x = np.random.randint(0, 10**7, 10**7) in [26]: df1 = pd.dataframe({'x':x}) in [27]: df2 = df1.set_index('x', drop=false) in [28]: %timeit df2.ix[0] 1 loops, best of 3: 402 ms per loop in [29]: %timeit df1.ix[0] 10000 loops, best of 3: 123 per loop 

(i realize 2 ix queries don't return same thing -- it's example calls ix on non-unique index appear slower)

is there way coax pandas using faster lookup methods binary search on non-unique and/or sorted indices?

when index unique, pandas use hashtable map key value o(1). when index non-unique , sorted, pandas use binary search o(logn), when index random ordered pandas need check keys in index o(n).

you can call sort_index method:

import numpy np import pandas pd x = np.random.randint(0, 200, 10**6) df1 = pd.dataframe({'x':x}) df2 = df1.set_index('x', drop=false) df3 = df2.sort_index() %timeit df1.loc[100] %timeit df2.loc[100] %timeit df3.loc[100] 

result:

10000 loops, best of 3: 71.2 µs per loop 10 loops, best of 3: 38.9 ms per loop 10000 loops, best of 3: 134 µs per loop 

Comments

Popular posts from this blog

SPSS keyboard combination alters encoding -

Add new record to the table by click on the button in Microsoft Access -

javascript - jQuery .height() return 0 when visible but non-0 when hidden -