What is the point of indexing in pandas?
Like a dict, a DataFrame's index is backed by a hash table. Looking up rows
based on index values is like looking up dict values based on a key.
In contrast, the values in a column are like values in a list.
Looking up rows based on index values is faster than looking up rows based on column values.
For example, consider
df = pd.DataFrame({'foo':np.random.random(), 'index':range(10000)})
df_with_index = df.set_index(['index'])
Here is how you could look up any row where the df['index']
column equals 999.
Pandas has to loop through every value in the column to find the ones equal to 999.
df[df['index'] == 999]
# foo index
# 999 0.375489 999
Here is how you could lookup any row where the index equals 999. With an index, Pandas uses the hash value to find the rows:
df_with_index.loc[999]
# foo 0.375489
# index 999.000000
# Name: 999, dtype: float64
Looking up rows by index is much faster than looking up rows by column value:
In [254]: %timeit df[df['index'] == 999]
1000 loops, best of 3: 368 µs per loop
In [255]: %timeit df_with_index.loc[999]
10000 loops, best of 3: 57.7 µs per loop
Note however, it takes time to build the index:
In [220]: %timeit df.set_index(['index'])
1000 loops, best of 3: 330 µs per loop
So having the index is only advantageous when you have many lookups of this type
to perform.
Sometimes the index plays a role in reshaping the DataFrame. Many functions, such as set_index
, stack
, unstack
, pivot
, pivot_table
, melt
,lreshape
, and crosstab
, all use or manipulate the index.
Sometimes we want the DataFrame in a different shape for presentation purposes, or for join
, merge
or groupby
operations. (As you note joining can also be done based on column values, but joining based on the index is faster.) Behind the scenes, join
, merge
and groupby
take advantage of fast index lookups when possible.
Time series have resample
, asfreq
and interpolate
methods whose underlying implementations take advantage of fast index lookups too.
So in the end, I think the origin of the index's usefulness, why it shows up in so many functions, is due to its ability to perform fast hash
lookups.
What is the purpose of floating point index in Pandas?
Float indices are generally useless for label-based indexing, because of general floating point restrictions. Of course, pd.Float64Index
is there in the API for completeness but that doesn't always mean you should use it. Jeff (core library contributor) has this to say on github:
[...] It is rarely necessary to actually use a float index; you are often
better off served by using a column. The point of the index is to make
individual elements faster, e.g. df[1.0], but this is quite tricky;
this is the reason for having an issue about this.
The tricky part there being 1.0 == 1.0
isn't always true, depending on how you represent that 1.0
in bits.
Floating indices are useful in a few situations (as cited in the github issue), mainly for recording temporal axis (time), or extremely minute/accurate measurements in, for example, astronomical data. For most other cases there's pd.cut
or pd.qcut
for binning your data because working with categorical data is usually easier than continuous data.
Index objects in pandas--why pd.columns returns index rather than list
From the documentation for pandas.Index
Immutable ndarray implementing an ordered, sliceable set. The basic object storing axis labels for all pandas objects
Having a regular list as an index for a DataFrame could cause issues with unorderable or unhashable objects, evidently - since it is backed by a hash table, the same principles apply as to why lists can't be dictionary keys in regular Python.
At the same time, the Index object being explicit permits us to use different types as an Index, as compared to the implicit integer index that NumPy has for instance, and perform fast lookups.
If you want to retrieve a list of column names, the Index object has a tolist
method.
>>> df.columns.tolist()
['a', 'b', 'c']
Best practices for indexing with pandas
No, they are not the same. One uses direct syntax while the other relies on chained indexing.
The crucial points are:
pd.DataFrame.iloc
is used primarily for integer position-based indexing.pd.DataFrame.loc
is most often used with labels or Boolean arrays.- Chained indexing, i.e. via
df[x][y]
, is explicitly discouraged and is never necessary. idx.values
returns thenumpy
array representation ofidx
series. This cannot feed.iloc
and is not necessary to feed.loc
, which can takeidx
directly.
Below are two examples which would work. In either example, you can use similar syntax to mask a dataframe or series. For example, df['hr'].loc[mask]
would work as well as df.loc[mask]
.
iloc
Here we use numpy.where
to extract integer indices of True
elements in a Boolean series. iloc
does accept Boolean arrays but, in my opinion, this is less clear; "i" stands for integer.
idx = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
mask = np.where(idx)[0]
df = df.iloc[mask]
loc
Using loc
is more natural when we are already querying by specific series.
mask = (df['timestamp'] >= 5) & (df['timestamp'] <= 10)
df = df.loc[mask]
- When masking only rows, you can omit the
loc
accessor altogether and usedf[mask]
. - If masking by rows and filtering for a column, you can use
df.loc[mask, 'col_name']
Indexing and Selecting Data is fundamental to pandas
: there is no substitute for reading the official documentation.
What causes indexing past lexsort depth warning in Pandas?
TL;DR: your index is unsorted and this severely impacts performance.
Sort your DataFrame's index using df.sort_index()
to address the warning and improve performance.
I've actually written about this in detail in my writeup: Select rows in pandas MultiIndex DataFrame (under "Question 3").
To reproduce,
mux = pd.MultiIndex.from_arrays([
list('aaaabbbbbccddddd'),
list('tuvwtuvwtuvwtuvw')
], names=['one', 'two'])
df = pd.DataFrame({'col': np.arange(len(mux))}, mux)
col
one two
a t 0
u 1
v 2
w 3
b t 4
u 5
v 6
w 7
t 8
c u 9
v 10
d w 11
t 12
u 13
v 14
w 15
You'll notice that the second level is not properly sorted.
Now, try to index a specific cross section:
df.loc[pd.IndexSlice[('c', 'u')]]
PerformanceWarning: indexing past lexsort depth may impact performance.
# encoding: utf-8
col
one two
c u 9
You'll see the same behaviour with xs
:
df.xs(('c', 'u'), axis=0)
PerformanceWarning: indexing past lexsort depth may impact performance.
self.interact()
col
one two
c u 9
The docs, backed by this timing test I once did seem to suggest that handling un-sorted indexes imposes a slowdown—Indexing is O(N) time when it could/should be O(1).
If you sort the index before slicing, you'll notice the difference:
df2 = df.sort_index()
df2.loc[pd.IndexSlice[('c', 'u')]]
col
one two
c u 9
%timeit df.loc[pd.IndexSlice[('c', 'u')]]
%timeit df2.loc[pd.IndexSlice[('c', 'u')]]
802 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
648 µs ± 20.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Finally, if you want to know whether the index is sorted or not, check with MultiIndex.is_lexsorted
.
df.index.is_lexsorted()
# False
df2.index.is_lexsorted()
# True
As for your question on how to induce this behaviour, simply permuting the indices should suffice. This works if your index is unique:
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
If your index is not unique, add a cumcount
ed level first,
df.set_index(
df.groupby(level=list(range(len(df.index.levels)))).cumcount(), append=True)
df2 = df.loc[pd.MultiIndex.from_tuples(np.random.permutation(df2.index))]
df2 = df2.reset_index(level=-1, drop=True)
Related Topics
How to Sort a List with Two Keys But One in Reverse Order
Converting String to Int Using Try/Except in Python
Oserror [Errno 22] Invalid Argument When Use Open() in Python
What Is the G Object in This Flask Code
Selecting Multiple Slices from a Numpy Array at Once
Passing Table Name as a Parameter in Psycopg2
How to Get Rid of Double Backslash in Python Windows File Path String
Problem with Multi Threaded Python App and Socket Connections
How to Resolve Typeerror: Can Only Concatenate Str (Not "Int") to Str
Making a Countdown Timer with Python and Tkinter
How to Pass an Operator to a Python Function
Python Pandas Max Value in a Group as a New Column
Remove File After Flask Serves It
How to Account for Period (Am/Pm) Using Strftime