What Is the Point of Indexing in Pandas

What is the point of indexing in pandas?

Like a dict, a DataFrame's index is backed by a hash table. Looking up rows
based on index values is like looking up dict values based on a key.

In contrast, the values in a column are like values in a list.

Looking up rows based on index values is faster than looking up rows based on column values.

For example, consider

df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])

Here is how you could look up any row where the df['index'] column equals 999.
Pandas has to loop through every value in the column to find the ones equal to 999.

df[df['index'] == 999]

# foo index
# 999 0.375489 999

Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:

df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64

Looking up rows by index is much faster than looking up rows by column value:

In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop

In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop

Note however, it takes time to build the index:

In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop

So having the index is only advantageous when you have many lookups of this type
to perform.

Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index, stack, unstack, pivot, pivot_table, melt,
lreshape, and crosstab, all use or manipulate the index.
Sometimes we want the DataFrame in a different shape for presentation purposes, or for join, merge or groupby operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join, merge and groupby take advantage of fast index lookups when possible.

Time series have resample, asfreq and interpolate methods whose underlying implementations take advantage of fast index lookups too.

So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash
lookups.

What is the purpose of floating point index in Pandas?

Float indices are generally useless for label-based indexing, because of general floating point restrictions. Of course, pd.Float64Index is there in the API for completeness but that doesn't always mean you should use it. Jeff (core library contributor) has this to say on github:

[...] It is rarely necessary to actually use a float index; you are often
better off served by using a column. The point of the index is to make
individual elements faster, e.g. df[1.0], but this is quite tricky;
this is the reason for having an issue about this.

The tricky part there being 1.0 == 1.0 isn't always true, depending on how you represent that 1.0 in bits.

Floating indices are useful in a few situations (as cited in the github issue), mainly for recording temporal axis (time), or extremely minute/accurate measurements in, for example, astronomical data. For most other cases there's pd.cut or pd.qcut for binning your data because working with categorical data is usually easier than continuous data.

Index objects in pandas--why pd.columns returns index rather than list

From the documentation for pandas.Index

Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects

Having a regular list as an index for a DataFrame could cause issues with unorderable or unhashable objects, evidently - since it is backed by a hash table, the same principles apply as to why lists can't be dictionary keys in regular Python.

At the same time, the Index object being explicit permits us to use different types as an Index, as compared to the implicit integer index that NumPy has for instance, and perform fast lookups.

If you want to retrieve a list of column names, the Index object has a tolist method.

>>> df.columns.tolist()
['a', 'b', 'c']

Best practices for indexing with pandas

No, they are not the same. One uses direct syntax while the other relies on chained indexing.

The crucial points are:

  • pd.DataFrame.iloc is used primarily for integer position-based indexing.
  • pd.DataFrame.loc is most often used with labels or Boolean arrays.
  • Chained indexing, i.e. via df[x][y], is explicitly discouraged and is never necessary.
  • idx.values returns the numpy array representation of idx series. This cannot feed .iloc and is not necessary to feed .loc, which can take idx directly.

Below are two examples which would work. In either example, you can use similar syntax to mask a dataframe or series. For example, df['hr'].loc[mask] would work as well as df.loc[mask].

iloc

Here we use numpy.where to extract integer indices of True elements in a Boolean series. iloc does accept Boolean arrays but, in my opinion, this is less clear; "i" stands for integer.

idx = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
mask = np.where(idx)[0]
df = df.iloc[mask]

loc

Using loc is more natural when we are already querying by specific series.

mask = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
df = df.loc[mask]
  • When masking only rows, you can omit the loc accessor altogether and use df[mask].
  • If masking by rows and filtering for a column, you can use df.loc[mask, 'col_name']

Indexing and Selecting Data is fundamental to pandas: there is no substitute for reading the official documentation.

What causes indexing past lexsort depth warning in Pandas?

TL;DR: your index is unsorted and this severely impacts performance.

Sort your DataFrame's index using df.sort_index() to address the warning and improve performance.


I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").

To reproduce,

mux = pd.MultiIndex.from_arrays([
list('aaaabbbbbccddddd'),
list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])

df = pd.DataFrame({'col': np.arange(len(mux))}, mux)

col
one two
a t 0
u 1
v 2
w 3
b t 4
u 5
v 6
w 7
t 8
c u 9
v 10
d w 11
t 12
u 13
v 14
w 15

You'll notice that the second level is not properly sorted.

Now, try to index a specific cross section:

df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
# encoding: utf-8

col
one two
c u 9

You'll see the same behaviour with xs:

df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
self.interact()

col
one two
c u 9

The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).

If you sort the index before slicing, you'll notice the difference:

df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]

col
one two
c u 9

%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]

802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted.

df.index.is_lexsorted()
# False

df2.index.is_lexsorted()
# True

As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:

df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]

If your index is not unique, add a cumcounted level first,

df.set_index(
df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True)
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)


Related Topics



Leave a reply



Submit