Selection with .loc in python
pd.DataFrame.loc
can take one or two indexers. For the rest of the post, I'll represent the first indexer as i
and the second indexer as j
.
If only one indexer is provided, it applies to the index of the dataframe and the missing indexer is assumed to represent all columns. So the following two examples are equivalent.
df.loc[i]
df.loc[i, :]
Where :
is used to represent all columns.
If both indexers are present, i
references index values and j
references column values.
Now we can focus on what types of values i
and j
can assume. Let's use the following dataframe df
as our example:
df = pd.DataFrame([[1, 2], [3, 4]], index=['A', 'B'], columns=['X', 'Y'])
loc
has been written such that i
and j
can be
scalars that should be values in the respective index objects
df.loc['A', 'Y']
2arrays whose elements are also members of the respective index object (notice that the order of the array I pass to
loc
is respecteddf.loc[['B', 'A'], 'X']
B 3
A 1
Name: X, dtype: int64Notice the dimensionality of the return object when passing arrays.
i
is an array as it was above,loc
returns an object in which an index with those values is returned. In this case, becausej
was a scalar,loc
returned apd.Series
object. We could've manipulated this to return a dataframe if we passed an array fori
andj
, and the array could've have just been a single value'd array.df.loc[['B', 'A'], ['X']]
X
B 3
A 1
boolean arrays whose elements are
True
orFalse
and whose length matches the length of the respective index. In this case,loc
simply grabs the rows (or columns) in which the boolean array isTrue
.df.loc[[True, False], ['X']]
X
A 1
In addition to what indexers you can pass to loc
, it also enables you to make assignments. Now we can break down the line of code you provided.
iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
iris_data['class'] == 'versicolor'
returns a boolean array.class
is a scalar that represents a value in the columns object.iris_data.loc[iris_data['class'] == 'versicolor', 'class']
returns apd.Series
object consisting of the'class'
column for all rows where'class'
is'versicolor'
When used with an assignment operator:
iris_data.loc[iris_data['class'] == 'versicolor', 'class'] = 'Iris-versicolor'
We assign
'Iris-versicolor'
for all elements in column'class'
where'class'
was'versicolor'
Python Pandas - using .loc to select with AND and OR on multiple columns
You should use &
instead of and
as well as wrap parentheses around each condition. Formatting on new lines with everything lining up also helps prevent mistakes with parentheses:
dfWater1 = left_merged.loc[((left_merged.BVG_2M.isin(['34'])) &
(left_merged.VHC_SC.isin(['6. Nil veg'])) &
(left_merged.wetland.isin(['Estuarine wetlands (e.g. mangroves).', 'Lacustrine wetland (e.g. lake).'])))
| (left_merged.RE.isin(['water', 'reef', 'ocean', 'estuary', 'canal']))].copy()
Pandas - row selection with column operation & .loc[]
You can directly add do the conditional check on the entire column in pandas. Here is a demonstration:
df = pd.DataFrame({"Country": ['Afghanistan', 'Algeria', 'Argentina'],
"01 !": [0, 0, 1],
"02 !": [5, 2, 8],
"03 !": [18, 24, 28],
"Total": [23, 26, 38]})
print(df["01 !"] + df["02 !"] + df["03 !"] == df["Total"])
Output:
0 True
1 True
2 False
You can also add the check output as a new column in your data frame by doing this:
df["check"] = df["01 !"] + df["02 !"] + df["03 !"] == df["Total"]
Then your DataFrame will look like this:
Country 01 ! 02 ! 03 ! Total check
0 Afghanistan 0 5 18 23 True
1 Algeria 0 2 24 26 True
2 Argentina 1 8 28 38 False
Use of panda .loc function in order to select a specific data within a column
df = df_atp.loc[df_atp["Date"].between("01/01/2016", "31/12/2017"), ['Winner', 'Wsets']]
df = df[df['Winner'] == 'Federer R.']
print(df)
is the most readable way to do it. You could also do
df = df_atp.loc[df_atp["Date"].between("01/01/2016", "31/12/2017") & (df['Winner'] == 'Federer R.'), ['Winner', 'Wsets']]
to do it in one line, but I'd favor the first approach for legibility.
Select rows from a DataFrame using .loc and multiple conditions and then show the row corresponding to the min/max of one column
Use this:
for columns:
df.loc[(df['A'] == True)&(df['B'] == 'Tuesday')].apply(max, axis=0)
for rows:
df.loc[(df['A'] == True)&(df['B'] == 'Tuesday')].apply(max, axis=1)
Selecting with complex criteria from pandas.DataFrame
Sure! Setup:
>>> import pandas as pd
>>> from random import randint
>>> df = pd.DataFrame({'A': [randint(1, 9) for x in range(10)],
'B': [randint(1, 9)*10 for x in range(10)],
'C': [randint(1, 9)*100 for x in range(10)]})
>>> df
A B C
0 9 40 300
1 9 70 700
2 5 70 900
3 8 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
We can apply column operations and get boolean Series objects:
>>> df["B"] > 50
0 False
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
9 True
Name: B
>>> (df["B"] > 50) & (df["C"] == 900)
0 False
1 False
2 True
3 True
4 False
5 False
6 False
7 False
8 False
9 False
[Update, to switch to new-style .loc
]:
And then we can use these to index into the object. For read access, you can chain indices:
>>> df["A"][(df["B"] > 50) & (df["C"] == 900)]
2 5
3 8
Name: A, dtype: int64
but you can get yourself into trouble because of the difference between a view and a copy doing this for write access. You can use .loc
instead:
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"]
2 5
3 8
Name: A, dtype: int64
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"].values
array([5, 8], dtype=int64)
>>> df.loc[(df["B"] > 50) & (df["C"] == 900), "A"] *= 1000
>>> df
A B C
0 9 40 300
1 9 70 700
2 5000 70 900
3 8000 80 900
4 7 50 200
5 9 30 900
6 2 80 700
7 2 80 400
8 5 80 300
9 7 70 800
Note that I accidentally typed == 900
and not != 900
, or ~(df["C"] == 900)
, but I'm too lazy to fix it. Exercise for the reader. :^)
Python's `.loc` is really slow on selecting subsets of Data
One idea is use Index.isin
with itertools.product
in boolean indexing
:
from itertools import product
idx_sub = tuple(product(Y, T))
dict_sub = df.loc[df.index.isin(idx_sub),'Value'].to_dict()
print (dict_sub)
How are iloc and loc different?
Label vs. Location
The main distinction between the two methods is:
loc
gets rows (and/or columns) with particular labels.iloc
gets rows (and/or columns) at integer locations.
To demonstrate, consider a series s
of characters with a non-monotonic integer index:
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49 a
48 b
47 c
0 d
1 e
2 f
>>> s.loc[0] # value at index label 0
'd'
>>> s.iloc[0] # value at index location 0
'a'
>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e
>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a
Here are some of the differences/similarities between s.loc
and s.iloc
when passed various objects:
<object> | description | s.loc[<object>] | s.iloc[<object>] |
---|---|---|---|
0 | single item | Value at index label 0 (the string 'd' ) | Value at index location 0 (the string 'a' ) |
0:1 | slice | Two rows (labels 0 and 1 ) | One row (first row at location 0) |
1:47 | slice with out-of-bounds end | Zero rows (empty Series) | Five rows (location 1 onwards) |
1:47:-1 | slice with negative step | three rows (labels 1 back to 47 ) | Zero rows (empty Series) |
[2, 0] | integer list | Two rows with given labels | Two rows with given locations |
s > 'e' | Bool series (indicating which values have the property) | One row (containing 'f' ) | NotImplementedError |
(s>'e').values | Bool array | One row (containing 'f' ) | Same as loc |
999 | int object not in index | KeyError | IndexError (out of bounds) |
-1 | int object not in index | KeyError | Returns last value in s |
lambda x: x.index[3] | callable applied to series (here returning 3rd item in index) | s.loc[s.index[3]] | s.iloc[s.index[3]] |
Related Topics
Python Urllib2 Basic Auth Problem
Differencebetween Encode/Decode
Why Does My Pandas Dataframe Not Display New Order Using 'Sort_Values'
Comparing Previous Row Values in Pandas Dataframe
How to Check If Code Is Executed in the Ipython Notebook
It Is More Efficient to Use If-Return-Return or If-Else-Return
How to Automatically Fix an Invalid JSON String
Streaming Data with Python and Flask
How to Save and Load Numpy.Array() Data Properly
Add Params to Given Url in Python
Imports in _Init_.Py and 'Import As' Statement
Two Values from One Input in Python
N-Grams in Python, Four, Five, Six Grams