What Is the Performance Impact of Non-Unique Indexes in Pandas

What is the performance impact of non-unique indexes in pandas?

When index is unique, pandas use a hashtable to map key to value O(1). When index is non-unique and sorted, pandas use binary search O(logN), when index is random ordered pandas need to check all the keys in the index O(N).

You can call sort_index method:

import numpy as np
import pandas as pd
x = np.random.randint(0, 200, 10**6)
df1 = pd.DataFrame({'x':x})
df2 = df1.set_index('x', drop=False)
df3 = df2.sort_index()
%timeit df1.loc[100]
%timeit df2.loc[100]
%timeit df3.loc[100]

result:

10000 loops, best of 3: 71.2 µs per loop
10 loops, best of 3: 38.9 ms per loop
10000 loops, best of 3: 134 µs per loop

Why does Pandas allow non-unique index?

Disclaimer: a unique RangeIndex is always going to be the most performant option. This question seems to favour using a unique index and is specifically looking for cases where allowing non-unique indexes is desired. For this reason, from this point forward unique indexes are not going to be discussed, nor is performance, only the useful benefits of non-unique indexes.

The general places non-unique indexes are preferable whenever we need to keep track of where data originally came from. There are many cases where, in the intermediary phases, we need to know what row the data was at. This lets us do computations with respect to information that would either be lost if the index was unique, or would require adding an additional column to track it. Below are just a few examples:



Interleaving multiple DataFrames:

Consider the following 2 DataFrames, let's assume that each DataFrame represents a day's worth of data. We would like to review this daily data by Sample number rather than by day:

df1 = pd.DataFrame([['10:05', 'Day 1', 'Sample 1'],
['11:14', 'Day 1', 'Sample 2']])
df2 = pd.DataFrame([['10:03', 'Day 2', 'Sample 1'],
['11:12', 'Day 1', 'Sample 2']])
# df1
0 1 2
0 10:05 Day 1 Sample 1
1 11:14 Day 1 Sample 2

#df2
0 1 2
0 10:03 Day 2 Sample 1
1 11:12 Day 1 Sample 2

Because pandas allows non-unique indexes we can concat then sort_index:

pd.concat([df1, df2]).sort_index()
       0      1         2
0 10:05 Day 1 Sample 1
0 10:03 Day 2 Sample 1
1 11:14 Day 1 Sample 2
1 11:12 Day 1 Sample 2

Notice this is the fastest way to interleave two DataFrames by row index. Also notice that it would not be feasible to instead sort by columns 1 and 2 as the words Day 1 Sample 1 etc. will be lexicographically sorted which will run into issues for values like Day 10, or would require a bunch of additional computation to handle the numeric values correctly.

We can add ignore_index=True to sort_index, but this only hides away overwriting with a new range index and still relies on the fact that concat returns a DataFrame with non-unique indexes.

pd.concat([df1, df2]).sort_index(ignore_index=True)
       0      1         2
0 10:05 Day 1 Sample 1
1 10:03 Day 2 Sample 1
2 11:14 Day 1 Sample 2
3 11:12 Day 1 Sample 2


Explode and Reduce

explode, particularly on Series, is a common operation and not losing the index (allowing duplicates) makes it so much easier to do expand and reduce type operations.

The goal is to remove any duplicate values from within a comma separated string within a column:

df = pd.DataFrame({
'corresponding_id': [10, 20, 30],
'col': ['a,b,c,a', 'b,c,c,b', 'a,a,a,a']
})

df:

   corresponding_id      col
0 10 a,b,c,a
1 20 b,c,c,b
2 30 a,a,a,a

A common solution may look something like:

df['col'] = (
df['col'].str.split(',').explode()
.groupby(level=0).apply(lambda s: ','.join(np.unique(s)))
)

df:

   corresponding_id    col
0 10 a,b,c
1 20 b,c
2 30 a

After exploding the result looks like:

df['col'].str.split(',').explode()

0 a
0 b
0 c
0 a
1 b
1 c
1 c
1 b
2 a
2 a
2 a
2 a
Name: col, dtype: object

Because there are duplicate indexes we can groupby relative to level=0 (the index) this is only possible because the index was preserved. If the index did not allow duplicates we would have:

0     a
1 b
2 c
3 a
4 b
5 c
6 c
7 b
8 a
9 a
10 a
11 a
Name: col, dtype: object

There would be no way to easily determine from which rows the values came from, making it even more difficult to put them back in place.



Scaling Up a DataFrame

The ability to select from a DataFrame using duplicate labels is extremely helpful in scaling up a DataFrame.

df = pd.DataFrame({
'Count': [2, 4],
'Value': [1, 6]
})

On occasion we need to scale up a DataFrame, in these cases we use loc to select from the DataFrame:

df.loc[[0, 0, 1, 1, 1, 1], :]

Notice the result is:

   Count  Value
0 2 1
0 2 1
1 4 6
1 4 6
1 4 6
1 4 6

We were able to select the same row multiple times from the DataFrame based on duplicate labels (and the resulting index is non-unique). This is so common that there is a method Index.repeat that does this dynamically based on a column:

df.loc[df.index.repeat(df['Count']), :]

Count Value
0 2 1
0 2 1
1 4 6
1 4 6
1 4 6
1 4 6

Does pandas index have advantage on performance than column?

One answer is in terms of data frame size. I have a data frame with 50M rows

df_Usage.info()

output

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49991484 entries, 0 to 49991483
Data columns (total 7 columns):
BILL_ACCOUNT_NBR int64
MM_ADJ_BILLING_YEARMO int64
BILLING_USAGE_QTY float64
BILLING_DAYS_CNT int64
TARIFF_RATE_TYP object
READ_FROM object
READ_TO object
dtypes: float64(1), int64(3), object(3)
memory usage: 2.6+ GB

Setting the first two columns as index (one includes time)

df_Usage['MM_ADJ_BILLING_YEARMO'] = pd.to_datetime(df_Usage['MM_ADJ_BILLING_YEARMO'],  format='%Y%m')
df_Usage.set_index(['BILL_ACCOUNT_NBR','MM_ADJ_BILLING_YEARMO'],inplace = True)
df_Usage.info()
<class 'pandas.core.frame.DataFrame'>
MultiIndex: 49991484 entries, (5659128163, 2020-09-01 00:00:00) to (7150058108, 2020-01-01 00:00:00)
Data columns (total 5 columns):
BILLING_USAGE_QTY float64
BILLING_DAYS_CNT int64
TARIFF_RATE_TYP object
READ_FROM object
READ_TO object
dtypes: float64(1), int64(1), object(3)
memory usage: 2.1+ GB

20% reduction in memory

What causes indexing past lexsort depth warning in Pandas?

TL;DR: your index is unsorted and this severely impacts performance.

Sort your DataFrame's index using df.sort_index() to address the warning and improve performance.


I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").

To reproduce,

mux = pd.MultiIndex.from_arrays([
list('aaaabbbbbccddddd'),
list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

col
one two
a t 0
u 1
v 2
w 3
b t 4
u 5
v 6
w 7
t 8
c u 9
v 10
d w 11
t 12
u 13
v 14
w 15

You'll notice that the second level is not properly sorted.

Now, try to index a specific cross section:

df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
# encoding: utf-8

col
one two
c u 9

You'll see the same behaviour with xs:

df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
self.interact()

col
one two
c u 9

The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).

If you sort the index before slicing, you'll notice the difference:

df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]

col
one two
c u 9

%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]

802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted.

df.index.is_lexsorted()
# False

df2.index.is_lexsorted()
# True

As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:

df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]

If your index is not unique, add a cumcounted level first,

df.set_index(
df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True)
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)

Pandas Index Filter is Slower than Non-Index Column Filter

.loc is implemented in Python so it's slow.

The first way does two things:

  1. Compute .isin. Here's a link to the path that your code is taking. It relies on the hashtable module which is written in cython (and runs at near c speeds).
  2. Once you've computed a mask, applying it is mostly done with numpy which again means c speeds.

The moral of the story is that sticking in c/cython land is faster than operating in Python land.

Indexed lookup on pandas dataframe. Why so slow? How to speed up?

Repeated indices are guaranteed to slow down your dataframe indexing operations. You can amend your inputs to prove this to yourself:

a = pd.Series(data=-np.arange(100000), index=np.random.randint(0, 50000, 100000))
%timeit a.loc[common] # 34.1 ms

a = pd.Series(data=-np.arange(100000), index=np.arange(100000))
%timeit a.loc[common] # 6.86 ms

As mentioned in this related question:

When index is unique, pandas use a hashtable to map key to value O(1).
When index is non-unique and sorted, pandas use binary search O(logN),
when index is random ordered pandas need to check all the keys in the
index O(N).

How to get better performance when taking the mean of a specific subset of a dataframe in pandas?

Pandas .mean() seems to have reputation to be slow.

An idea I would have is to use numpy by converting using pandas' built-in .to_numpy(). However, then if you want to have column-wise mean calculation, numpy's .mean() needs and axis specification - otherwise it will calculate mean of all values in the numpy array.

import pandas as pd
import numpy as np
import random

# from @totalhack
mean_cols = ["A", "B"]

df = pd.DataFrame({
"A": range(0, 50000),
"B": range(0, 50000)
})

key_list = random.sample(range(50000), k=50000)
# in case that key_list are rownames (indexes), convert them into
# row_indexes, because numpy array won't have names. E.g. by:
# my_rownames = [x for x in your_df_with_rownames.indexes]
# key_list = [my_rownames.index(k) for k in your_old_keylist]

df_mc = np.array(df[mean_cols])

rows_list = []

for key in keys_list:

means_after = df_mc[key:key+5].mean(axis=0)
means_before = df_mc[key-5:key].mean(axis=0)
row_dict = {}

for col in mean_cols:
row_dict[str(col+'_after')] = round(means_after[mean_cols.index(col)], 2)
row_dict[str(col+'_before')] = round(means_before[mean_cols.index(col)], 2)

rows_list.append(row_dict)

If the data frame has only numeric values, it would accelerate calculations a lot more to convert it as early as possible to np.arrays. However, I guess there are text or date data too in the data frame. So the earliest time point to convert to numpy array is I guess directly after subsetting the mean_cols - so that is where I put .to_numpy().

Or using parallelization (using more cpus in parallel)


df_mc = np.array(df[mean_cols])

def mean_after(key, np_array=df_mc):
return list(np.round(np_array[key: key+5].mean(axis=0), 2))

def mean_before(key, np_array=df_mc):
return list(np.round(np_array[key-5:key].mean(axis=0), 2))

import multiprocessing as mp

pool = mp.Pool()

afters = pool.map(mean_after, keys_list)
befores = pool.map(mean_before, keys_list)

# for what do you need rows_list with dictionaires for each column value?
# why not accessing like this the afters or befores?

afters[row_idx][mean_cols.index(col)]
befores[row_idx][mean_cols.index(col)]



Related Topics



Leave a reply



Submit