python - What is the performance impact of non-unique indexes in pandas? -
from pandas documentation, i've gathered unique-valued indices make operations efficient, , non-unique indices tolerated.
from outside, doesn't non-unique indices taken advantage of in way. example, following ix
query slow enough seems scanning entire dataframe
in [23]: import numpy np in [24]: import pandas pd in [25]: x = np.random.randint(0, 10**7, 10**7) in [26]: df1 = pd.dataframe({'x':x}) in [27]: df2 = df1.set_index('x', drop=false) in [28]: %timeit df2.ix[0] 1 loops, best of 3: 402 ms per loop in [29]: %timeit df1.ix[0] 10000 loops, best of 3: 123 per loop
(i realize 2 ix
queries don't return same thing -- it's example calls ix
on non-unique index appear slower)
is there way coax pandas using faster lookup methods binary search on non-unique and/or sorted indices?
when index unique, pandas use hashtable map key value o(1). when index non-unique , sorted, pandas use binary search o(logn), when index random ordered pandas need check keys in index o(n).
you can call sort_index
method:
import numpy np import pandas pd x = np.random.randint(0, 200, 10**6) df1 = pd.dataframe({'x':x}) df2 = df1.set_index('x', drop=false) df3 = df2.sort_index() %timeit df1.loc[100] %timeit df2.loc[100] %timeit df3.loc[100]
result:
10000 loops, best of 3: 71.2 µs per loop 10 loops, best of 3: 38.9 ms per loop 10000 loops, best of 3: 134 µs per loop
Comments
Post a Comment