How to query MultiIndex index columns values in pandas
To query the df by the MultiIndex values, for example where (A > 1.7) and (B < 666):
In [536]: result_df = df.loc[(df.index.get_level_values('A') > 1.7) & (df.index.get_level_values('B') < 666)]
In [537]: result_df
Out[537]:
C
A B
3.3 222 43
333 59
5.5 333 56
Hence, to get for example the 'A' index values, if still required:In [538]: result_df.index.get_level_values('A')
Out[538]: Index([3.3, 3.3, 5.5], dtype=object)
The problem is, that in large data frames the performance of by index selection worse by 10% than the sorted regular rows selection. And in repetitive work, looping, the delay accumulated. See example:In [558]: df = store.select(STORE_EXTENT_BURSTS_DF_KEY)
In [559]: len(df)
Out[559]: 12857
In [560]: df.sort(inplace=True)
In [561]: df_without_index = df.reset_index()
In [562]: %timeit df.loc[(df.index.get_level_values('END_TIME') > 358200) & (df.index.get_level_values('START_TIME') < 361680)]
1000 loops, best of 3: 562 µs per loop
In [563]: %timeit df_without_index[(df_without_index.END_TIME > 358200) & (df_without_index.START_TIME < 361680)]
1000 loops, best of 3: 507 µs per loop
Python/Pandas - Query a MultiIndex Column
To answer my own question, it looks like query shouldn't be used at all (regardless of using MultiIndex columns) for selecting certain columns, based on the answer(s) here:
Select columns using pandas dataframe.query()
Select rows in pandas MultiIndex DataFrame
MultiIndex / Advanced Indexing
NoteHere is an introduction to some common idioms (henceforth referred to as the Four Idioms) we will be frequently re-visiting
This post will be structured in the following manner:Notes (much like this one) will be included for readers interested in learning about additional functionality, implementation details,
- The questions put forth in the OP will be addressed, one by one
- For each question, one or more methods applicable to solving this problem and getting the expected result will be demonstrated.
and other info cursory to the topic at hand. These notes have been
compiled through scouring the docs and uncovering various obscure
features, and from my own (admittedly limited) experience.All code samples have created and tested on pandas v0.23.4, python3.7. If something is not clear, or factually incorrect, or if you did not
find a solution applicable to your use case, please feel free to
suggest an edit, request clarification in the comments, or open a new
question, ....as applicable.
DataFrame.loc
- A general solution for selection by label (+pd.IndexSlice
for more complex applications involving slices)DataFrame.xs
- Extract a particular cross section from a Series/DataFrame.DataFrame.query
- Specify slicing and/or filtering operations dynamically (i.e., as an expression that is evaluated dynamically. Is more applicable to some scenarios than others. Also see this section of the docs for querying on MultiIndexes.Boolean indexing with a mask generated using
MultiIndex.get_level_values
(often in conjunction withIndex.isin
, especially when filtering with multiple values). This is also quite useful in some circumstances.
You can useQuestion 1
How do I select rows having "a" in level "one"?col
one two
a t 0
u 1
v 2
w 3
loc
, as a general purpose solution applicable to most situations:df.loc[['a']]
At this point, if you getTypeError: Expected tuple, got str
That means you're using an older version of pandas. Consider upgrading! Otherwise, use df.loc[('a', slice(None)), :]
.Alternatively, you can use xs
here, since we are extracting a single cross section. Note the levels
and axis
arguments (reasonable defaults can be assumed here).
df.xs('a', level=0, axis=0, drop_level=False)
# df.xs('a', drop_level=False)
Here, the drop_level=False
argument is needed to prevent xs
from dropping level "one" in the result (the level we sliced on).Yet another option here is using query
:
df.query("one == 'a'")
If the index did not have a name, you would need to change your query string to be "ilevel_0 == 'a'"
.Finally, using get_level_values
:
df[df.index.get_level_values('one') == 'a']
# If your levels are unnamed, or if you need to select by position (not label),
# df[df.index.get_level_values(0) == 'a']
Additionally, how would I be able to drop level "one" in the output?This can be easily done using eithercol
two
t 0
u 1
v 2
w 3
df.loc['a'] # Notice the single string argument instead the list.
Or,df.xs('a', level=0, axis=0, drop_level=True)
# df.xs('a')
Notice that we can omit the drop_level
argument (it is assumed to be True
by default).Note
You may notice that a filtered DataFrame may still have all the levels, even if they do not show when printing the DataFrame out. For example,You can get rid of these levels usingv = df.loc[['a']]
print(v)
col
one two
a t 0
u 1
v 2
w 3
print(v.index)
MultiIndex(levels=[['a', 'b', 'c', 'd'], ['t', 'u', 'v', 'w']],
labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
names=['one', 'two'])MultiIndex.remove_unused_levels
:v.index = v.index.remove_unused_levels()
print(v.index)
MultiIndex(levels=[['a'], ['t', 'u', 'v', 'w']],
labels=[[0, 0, 0, 0], [0, 1, 2, 3]],
names=['one', 'two'])
Intuitively, you would want something involvingQuestion 1b
How do I slice all rows with value "t" on level "two"?col
one two
a t 0
b t 4
t 8
d t 12
slice()
:df.loc[(slice(None), 't'), :]
It Just Works!™ But it is clunky. We can facilitate a more natural slicing syntax using the pd.IndexSlice
API here.idx = pd.IndexSlice
df.loc[idx[:, 't'], :]
This is much, much cleaner.NoteWith
Why is the trailing slice:
across the columns required? This is because,loc
can be used to select and slice along both axes (axis=0
oraxis=1
). Without explicitly making it clear which axis the slicing
is to be done on, the operation becomes ambiguous. See the big red box in the documentation on slicing.If you want to remove any shade of ambiguity,
loc
accepts anaxis
parameter:Without thedf.loc(axis=0)[pd.IndexSlice[:, 't']]
axis
parameter (i.e., just by doingdf.loc[pd.IndexSlice[:, 't']]
), slicing is assumed to be on the columns,
and aKeyError
will be raised in this circumstance.This is documented in slicers. For the purpose of this post, however, we will explicitly specify all axes.
xs
, it isdf.xs('t', axis=0, level=1, drop_level=False)
With query
, it isdf.query("two == 't'")
# Or, if the first level has no name,
# df.query("ilevel_1 == 't'")
And finally, with get_level_values
, you may dodf[df.index.get_level_values('two') == 't']
# Or, to perform selection by position/integer,
# df[df.index.get_level_values(1) == 't']
All to the same effect.Using loc, this is done in a similar fashion by specifying a list.Question 2
How can I select rows corresponding to items "b" and "d" in level "one"?col
one two
b t 4
u 5
v 6
w 7
t 8
d w 11
t 12
u 13
v 14
w 15
df.loc[['b', 'd']]
To solve the above problem of selecting "b" and "d", you can also use query
:items = ['b', 'd']
df.query("one in @items")
# df.query("one == @items", parser='pandas')
# df.query("one in ['b', 'd']")
# df.query("one == ['b', 'd']", parser='pandas')
NoteAnd, with
Yes, the default parser is'pandas'
, but it is important to highlight this syntax isn't conventionally python. The
Pandas parser generates a slightly different parse tree from the
expression. This is done to make some operations more intuitive to
specify. For more information, please read my post on
Dynamic Expression Evaluation in pandas using pd.eval().
get_level_values
+ Index.isin
:df[df.index.get_level_values("one").isin(['b', 'd'])]
WithQuestion 2b
How would I get all values corresponding to "t" and "w" in level "two"?col
one two
a t 0
w 3
b t 4
w 7
t 8
d w 11
t 12
w 15
loc
, this is possible only in conjuction with pd.IndexSlice
.df.loc[pd.IndexSlice[:, ['t', 'w']], :]
The first colon :
in pd.IndexSlice[:, ['t', 'w']]
means to slice across the first level. As the depth of the level being queried increases, you will need to specify more slices, one per level being sliced across. You will not need to specify more levels beyond the one being sliced, however.With query
, this is
items = ['t', 'w']
df.query("two in @items")
# df.query("two == @items", parser='pandas')
# df.query("two in ['t', 'w']")
# df.query("two == ['t', 'w']", parser='pandas')
With get_level_values
and Index.isin
(similar to above):df[df.index.get_level_values('two').isin(['t', 'w'])]
UseQuestion 3
How do I retrieve a cross section, i.e., a single row having a specific values
for the index fromdf
? Specifically, how do I retrieve the cross
section of('c', 'u')
, given bycol
one two
c u 9
loc
by specifying a tuple of keys:df.loc[('c', 'u'), :]
Or,df.loc[pd.IndexSlice[('c', 'u')]]
NoteWith
At this point, you may run into aPerformanceWarning
that looks like this:This just means that your index is not sorted. pandas depends on the index being sorted (in this case, lexicographically, since we are dealing with string values) for optimal search and retrieval. A quick fix would be to sort yourPerformanceWarning: indexing past lexsort depth may impact performance.
DataFrame in advance usingDataFrame.sort_index
. This is especially desirable from a performance standpoint if you plan on doing
multiple such queries in tandem:You can also usedf_sort = df.sort_index()
df_sort.loc[('c', 'u')]MultiIndex.is_lexsorted()
to check whether the index
is sorted or not. This function returnsTrue
orFalse
accordingly.
You can call this function to determine whether an additional sorting
step is required or not.
xs
, this is again simply passing a single tuple as the first argument, with all other arguments set to their appropriate defaults:df.xs(('c', 'u'))
With query
, things become a bit clunky:df.query("one == 'c' and two == 'u'")
You can see now that this is going to be relatively difficult to generalize. But is still OK for this particular problem.With accesses spanning multiple levels, get_level_values
can still be used, but is not recommended:
m1 = (df.index.get_level_values('one') == 'c')
m2 = (df.index.get_level_values('two') == 'u')
df[m1 & m2]
WithQuestion 4
How do I select the two rows corresponding to('c', 'u')
, and('a', 'w')
?col
one two
c u 9
a w 3
loc
, this is still as simple as:df.loc[[('c', 'u'), ('a', 'w')]]
# df.loc[pd.IndexSlice[[('c', 'u'), ('a', 'w')]]]
With query
, you will need to dynamically generate a query string by iterating over your cross sections and levels:cses = [('c', 'u'), ('a', 'w')]
levels = ['one', 'two']
# This is a useful check to make in advance.
assert all(len(levels) == len(cs) for cs in cses)
query = '(' + ') or ('.join([
' and '.join([f"({l} == {repr(c)})" for l, c in zip(levels, cs)])
for cs in cses
]) + ')'
print(query)
# ((one == 'c') and (two == 'u')) or ((one == 'a') and (two == 'w'))
df.query(query)
100% DO NOT RECOMMEND! But it is possible.What if I have multiple levels?
One option in this scenario would be to use droplevel
to drop the levels you're not checking, then use isin
to test membership, and then boolean index on the final result.
df[df.index.droplevel(unused_level).isin([('c', 'u'), ('a', 'w')])]
This is actually very difficult to do withQuestion 5
How can I retrieve all rows corresponding to "a" in level "one" or
"t" in level "two"?col
one two
a t 0
u 1
v 2
w 3
b t 4
t 8
d t 12
loc
while ensuring correctness and still maintaining code clarity. df.loc[pd.IndexSlice['a', 't']]
is incorrect, it is interpreted as df.loc[pd.IndexSlice[('a', 't')]]
(i.e., selecting a cross section). You may think of a solution with pd.concat
to handle each label separately:pd.concat([
df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
col
one two
a t 0
u 1
v 2
w 3
t 0 # Does this look right to you? No, it isn't!
b t 4
t 8
d t 12
But you'll notice one of the rows is duplicated. This is because that row satisfied both slicing conditions, and so appeared twice. You will instead need to dov = pd.concat([
df.loc[['a'],:], df.loc[pd.IndexSlice[:, 't'],:]
])
v[~v.index.duplicated()]
But if your DataFrame inherently contains duplicate indices (that you want), then this will not retain them. Use with extreme caution.With query
, this is stupidly simple:
df.query("one == 'a' or two == 't'")
With get_level_values
, this is still simple, but not as elegant:m1 = (df.index.get_level_values('one') == 'a')
m2 = (df.index.get_level_values('two') == 't')
df[m1 | m2]
This is a special case that I've added to help understand the applicability of the Four Idioms—this is one case where none of them will work effectively, since the slicing is very specific, and does not follow any real pattern.Question 6
How can I slice specific cross sections? For "a" and "b", I would like to select all rows with sub-levels "u" and "v", and
for "d", I would like to select rows with sub-level "w".col
one two
a u 1
v 2
b u 5
v 6
d w 11
w 15
Usually, slicing problems like this will require explicitly passing a list of keys to loc
. One way of doing this is with:
keys = [('a', 'u'), ('a', 'v'), ('b', 'u'), ('b', 'v'), ('d', 'w')]
df.loc[keys, :]
If you want to save some typing, you will recognise that there is a pattern to slicing "a", "b" and its sublevels, so we can separate the slicing task into two portions and concat
the result:pd.concat([
df.loc[(('a', 'b'), ('u', 'v')), :],
df.loc[('d', 'w'), :]
], axis=0)
Slicing specification for "a" and "b" is slightly cleaner (('a', 'b'), ('u', 'v'))
because the same sub-levels being indexed are the same for each level.This can be done usingQuestion 7
How do I get all rows where values in level "two" are greater than 5?col
one two
b 7 4
9 5
c 7 10
d 6 11
8 12
8 13
6 15
query
,df2.query("two > 5")
And get_level_values
.df2[df2.index.get_level_values('two') > 5]
Note
Similar to this example, we can filter based on any arbitrary condition using these constructs. In general, it is useful to remember thatloc
andxs
are specifically for label-based indexing, whilequery
andget_level_values
are helpful for building general conditional masks
for filtering.
Actually, most solutions here are applicable to columns as well, with minor changes. Consider:Bonus Question
What if I need to slice aMultiIndex
column?
np.random.seed(0)
mux3 = pd.MultiIndex.from_product([
list('ABCD'), list('efgh')
], names=['one','two'])
df3 = pd.DataFrame(np.random.choice(10, (3, len(mux))), columns=mux3)
print(df3)
one A B C D
two e f g h e f g h e f g h e f g h
0 5 0 3 3 7 9 3 5 2 4 7 6 8 8 1 6
1 7 7 8 1 5 9 8 9 4 3 0 3 5 0 2 3
2 8 1 3 3 3 7 0 1 9 9 0 4 7 3 2 7
These are the following changes you will need to make to the Four Idioms to have them working with columns.To slice with
loc
, use
or,df3.loc[:, ....] # Notice how we slice across the index with `:`.
df3.loc[:, pd.IndexSlice[...]]
To use
xs
as appropriate, just pass an argumentaxis=1
.You can access the column level values directly using
df.columns.get_level_values
. You will then need to do something like
Wheredf.loc[:, {condition}]
{condition}
represents some condition built usingcolumns.get_level_values
.To use
query
, your only option is to transpose, query on the index, and transpose again:
Not recommended, use one of the other 3 options.df3.T.query(...).T
The right way to query a pandas MultiIndex
I think what you did is fine, but there are alternative ways also.
>>> df = pd.DataFrame({
'stock':np.repeat( ['AAPL','GOOG','YHOO'], 3 ),
'date':np.tile( pd.date_range('5/5/2015', periods=3, freq='D'), 3 ),
'price':(np.random.randn(9).cumsum() + 10) })
>>> df = df.set_index(['stock','date'])
price
stock date
AAPL 2015-05-05 8.538459
2015-05-06 9.330140
2015-05-07 8.968898
GOOG 2015-05-05 8.964389
2015-05-06 9.828230
2015-05-07 9.992985
YHOO 2015-05-05 9.929548
2015-05-06 9.330295
2015-05-07 10.676468
A slightly more standard way of using loc twice>>> df.loc['AAPL'].loc['2015-05-05']
would be to do >>> df.loc['AAPL','2015-05-05']
price 8.538459
Name: (AAPL, 2015-05-05 00:00:00), dtype: float64
And instead of xs
you could use an IndexSlice. I think for 2 levels xs
is easier, but IndexSlice might be better past 2 levels.>>> idx=pd.IndexSlice
>>> df.loc[ idx[:,'2015-05-05'], : ]
price
stock date
AAPL 2015-05-05 8.538459
GOOG 2015-05-05 8.964389
YHOO 2015-05-05 9.929548
And to be honest, I think the absolute easiest way here is use either date or stock (or neither) as index and then most selections are very straightforward. For example, if you remove the index completely you can effortlessly select by date:>>> df = df.reset_index()
>>> df[ df['date']=='2015-05-05' ]
index stock date price
0 0 AAPL 2015-05-05 8.538459
3 3 GOOG 2015-05-05 8.964389
6 6 YHOO 2015-05-05 9.929548
Doing some quickie timings with 3 stocks and 3000 dates (=9000 rows), I found that a simple boolean selection (no index) was about 35% faster than xs, and xs was about 35% faster than using IndexSlice. But see Jeff's comment below, you should expect the boolean selection to perform relative worse with more rows.Of course, the best thing for you to do is test on your own data and see how it comes out.
selecting from multi-index pandas
One way is to use the get_level_values
Index method:
In [11]: df
Out[11]:
0
A B
1 4 1
2 5 2
3 6 3
In [12]: df.iloc[df.index.get_level_values('A') == 1]
Out[12]:
0
A B
1 4 1
In 0.13 you'll be able to use xs
with drop_level
argument:df.xs(1, level='A', drop_level=False) # axis=1 if columns
Note: if this were column MultiIndex rather than index, you could use the same technique:In [21]: df1 = df.T
In [22]: df1.iloc[:, df1.columns.get_level_values('A') == 1]
Out[22]:
A 1
B 4
0 1
pandas dataframe select columns in multiindex
There is a get_level_values
method that you can use in conjunction with boolean indexing to get the the intended result.
In [13]:
df = pd.DataFrame(np.random.random((4,4)))
df.columns = pd.MultiIndex.from_product([[1,2],['A','B']])
print df
1 2
A B A B
0 0.543980 0.628078 0.756941 0.698824
1 0.633005 0.089604 0.198510 0.783556
2 0.662391 0.541182 0.544060 0.059381
3 0.841242 0.634603 0.815334 0.848120
In [14]:
print df.iloc[:, df.columns.get_level_values(1)=='A']
1 2
A A
0 0.543980 0.756941
1 0.633005 0.198510
2 0.662391 0.544060
3 0.841242 0.815334
Querying MultiIndex DataFrame in Pandas
Use merge
with parameter left_index
and right_on
:
df = FirstDF.merge(SecondDF, left_index=True, right_on=['A','B'])['C'].to_frame()
print (df)
C
0 59
1 56
2 80
Another solution with isin
of MultiIndex
es and filtering by boolean indexing
:mask = FirstDF.index.isin(SecondDF.set_index(['A','B']).index)
#alternative solution
#mask = FirstDF.index.isin(list(map(tuple,SecondDF[['A','B']].values.tolist())))
df = FirstDF.loc[mask, ['C']].reset_index(drop=True)
print (df)
C
0 59
1 56
2 80
Detail:print (FirstDF.loc[mask, ['C']])
C
A B
'a' 'green' 59
'b' 'red' 56
'c' 'green' 80
EDIT:You can use merge
with outer join and indicator=True
parameter, then filter by boolean indexing
:
df1=FirstDF.merge(SecondDF, left_index=True, right_on=['A','B'], indicator=True, how='outer')
print (df1)
C A B _merge
2 43 'a' 'blue' left_only
0 59 'a' 'green' both
1 56 'b' 'red' both
2 80 'c' 'green' both
2 72 'c' 'orange' left_only
mask = df1['_merge'] != 'both'
df1 = df1.loc[mask, ['C']].reset_index(drop=True)
print (df1)
C
0 43
1 72
For second solution invert boolen mask by
~
:mask = FirstDF.index.isin(SecondDF.set_index(['A','B']).index)
#alternative solution
#mask = FirstDF.index.isin(list(map(tuple,SecondDF[['A','B']].values.tolist())))
df = FirstDF.loc[~mask, ['C']].reset_index(drop=True)
print (df)
C
0 43
1 72
Pandas multiindex: select by condition
Use tuples in DataFrame.loc
:
a = df.loc[df[("cat2", "vals2")] == 7, ('cat1', 'vals1')]
print (a)
Last if need scalar from one element Series
:out = a.iat[0]
If possible no match:out = next(iter(a), 'no match')
You can compare sliced rows, columns - output is DataFrame filled by boolean - so for boolean Series
need test if any True per row by DataFrame.any
or all Trues per rows by DataFrame.all
:m = df.loc[:, ("cat2", slice(None))]==7
a = df.loc[m.any(axis=1), ("cat1", "vals2")]
print (a)
a d 5
Name: (cat1, vals2), dtype: int32
m = df.loc[:, ("cat2", slice(None))]==7
df2 = df.loc[m.any(axis=1), ("cat1", slice(None))]
print (df2)
cat1
vals1 vals2
a d 4 5
Related Topics
Most Efficient Way to Reverse a Numpy Array
Running Jupyter with Multiple Python and Ipython Paths
Make Part of a Matplotlib Title Bold and a Different Color
How to Convert an Integer to the Shortest Url-Safe String in Python
Subclassing Python Dictionary to Override _Setitem_
Wrapping Around on a List When List Index Is Out of Range
Asyncio: How to Cancel a Future Been Run by an Executor
Loading JSONl File as JSON Objects
Using Python Requests: Sessions, Cookies, and Post
Spark Dataframe: Computing Row-Wise Mean (Or Any Aggregate Operation)
How to Use a Default Namespace in an Lxml Xpath Query
Find the Date for the First Monday After a Given Date
How to Combine Multiple Rows into a Single Row with Pandas
Django - Makemigrations - No Changes Detected
Python Running as Windows Service: Oserror: [Winerror 6] the Handle Is Invalid
How to Write Utf-8 in a CSV File