Python: Pandas Series - Why use loc?
Explicit is better than implicit.
df[boolean_mask]
selects rows whereboolean_mask
is True, but there is a corner case when you might not want it to: whendf
has boolean-valued column labels:In [229]: df = pd.DataFrame({True:[1,2,3],False:[3,4,5]}); df
Out[229]:
False True
0 3 1
1 4 2
2 5 3You might want to use
df[[True]]
to select theTrue
column. Instead it raises aValueError
:In [230]: df[[True]]
ValueError: Item wrong length 1 instead of 3.Versus using
loc
:In [231]: df.loc[[True]]
Out[231]:
False True
0 3 1In contrast, the following does not raise
ValueError
even though the structure ofdf2
is almost the same asdf1
above:In [258]: df2 = pd.DataFrame({'A':[1,2,3],'B':[3,4,5]}); df2
Out[258]:
A B
0 1 3
1 2 4
2 3 5
In [259]: df2[['B']]
Out[259]:
B
0 3
1 4
2 5Thus,
df[boolean_mask]
does not always behave the same asdf.loc[boolean_mask]
. Even though this is arguably an unlikely use case, I would recommend always usingdf.loc[boolean_mask]
instead ofdf[boolean_mask]
because the meaning ofdf.loc
's syntax is explicit. Withdf.loc[indexer]
you know automatically thatdf.loc
is selecting rows. In contrast, it is not clear ifdf[indexer]
will select rows or columns (or raiseValueError
) without knowing details aboutindexer
anddf
.df.loc[row_indexer, column_index]
can select rows and columns.df[indexer]
can only select rows or columns depending on the type of values inindexer
and the type of column valuesdf
has (again, are they boolean?).In [237]: df2.loc[[True,False,True], 'B']
Out[237]:
0 3
2 5
Name: B, dtype: int64When a slice is passed to
df.loc
the end-points are included in the range. When a slice is passed todf[...]
, the slice is interpreted as a half-open interval:In [239]: df2.loc[1:2]
Out[239]:
A B
1 2 4
2 3 5
In [271]: df2[1:2]
Out[271]:
A B
1 2 4
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?
In the following situations, they behave the same:
- Selecting a single column (
df['A']
is the same asdf.loc[:, 'A']
-> selects column A) - Selecting a list of columns (
df[['A', 'B', 'C']]
is the same asdf.loc[:, ['A', 'B', 'C']]
-> selects columns A, B and C) - Slicing by rows (
df[1:3]
is the same asdf.iloc[1:3]
-> selects rows 1 and 2. Note, however, if you slice rows withloc
, instead ofiloc
, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)
However, []
does not work in the following situations:
- You can select a single row with
df.loc[row_label]
- You can select a list of rows with
df.loc[[row_label1, row_label2]]
- You can slice columns with
df.loc[:, 'A':'C']
These three cannot be done with []
.
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.
df[1:3]['A'] = 5
This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:
df.loc[1:3, 'A'] = 5
With .loc
, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']
), select a single row (df.loc[5]
), and select a list of rows (df.loc[[1, 2, 5]]
).
Also note that these two were not included in the API at the same time. .loc
was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.
Note: Getting columns with []
vs .
is a completely different topic. .
is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1
won't work if there is no column a
). Other than that, .
and []
are the same.
When to use .loc and when not to use (Pandas Dataframe)?
Why is .loc not used in 2nd criteria?
df = pd.DataFrame({
'col_1':[0,3,0,7,1,0],
'col_2':[0,3,6,9,2,4],
'col3':list('aaabbb')
})
No, you are wrong, it working in both.
print (df.loc[df['col_1']==0])
col_1 col_2 col3
0 0 0 a
2 0 6 a
5 0 4 b
print (df.loc[(df['col_1']==0) & (df['col_2']>0)])
col_1 col_2 col3
2 0 6 a
5 0 4 b
print (df[df['col_1']==0])
col_1 col_2 col3
0 0 0 a
2 0 6 a
5 0 4 b
print (df[(df['col_1']==0) & (df['col_2']>0)])
col_1 col_2 col3
2 0 6 a
5 0 4 b
Reason for using is if need also filter columns names, e.g. col_1
:
print (df.loc[df['col_1']==0, 'col_2'])
0 0
2 6
5 4
Name: col_2, dtype: int64
print (df.loc[(df['col_1']==0) & (df['col_2']>0), 'col_2'])
2 6
5 4
Name: col_2, dtype: int64
If need filter 2 or more columns use list, e.g for col_1,col3
use:
print (df.loc[df['col_1']==0, ['col_1','col3']])
col_1 col3
0 0 a
2 0 a
5 0 b
print (df.loc[(df['col_1']==0) & (df['col_2']>0), ['col_1','col3']])
col_1 col3
2 0 a
5 0 b
If omit loc
it failed:
df[df['col_1']==0, 'col_1']
df[(df['col_1']==0) & (df['col_2']>0), 'col_1']
TypeError
Also, why can't we use and in the second code, i.e.
df[(df['col_1']==0) and (df['col_2']>0)]
becasue and
is for processing by scalars, in pandas are used &
for bitwise AND
- &
. More info is here.
How are iloc and loc different?
Label vs. Location
The main distinction between the two methods is:
loc
gets rows (and/or columns) with particular labels.iloc
gets rows (and/or columns) at integer locations.
To demonstrate, consider a series s
of characters with a non-monotonic integer index:
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49 a
48 b
47 c
0 d
1 e
2 f
>>> s.loc[0] # value at index label 0
'd'
>>> s.iloc[0] # value at index location 0
'a'
>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e
>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a
Here are some of the differences/similarities between s.loc
and s.iloc
when passed various objects:
<object> | description | s.loc[<object>] | s.iloc[<object>] |
---|---|---|---|
0 | single item | Value at index label 0 (the string 'd' ) | Value at index location 0 (the string 'a' ) |
0:1 | slice | Two rows (labels 0 and 1 ) | One row (first row at location 0) |
1:47 | slice with out-of-bounds end | Zero rows (empty Series) | Five rows (location 1 onwards) |
1:47:-1 | slice with negative step | three rows (labels 1 back to 47 ) | Zero rows (empty Series) |
[2, 0] | integer list | Two rows with given labels | Two rows with given locations |
s > 'e' | Bool series (indicating which values have the property) | One row (containing 'f' ) | NotImplementedError |
(s>'e').values | Bool array | One row (containing 'f' ) | Same as loc |
999 | int object not in index | KeyError | IndexError (out of bounds) |
-1 | int object not in index | KeyError | Returns last value in s |
lambda x: x.index[3] | callable applied to series (here returning 3rd item in index) | s.loc[s.index[3]] | s.iloc[s.index[3]] |
pandas .at versus .loc
Update: df.get_value
is deprecated as of version 0.21.0. Using df.at
or df.iat
is the recommended method going forward.
df.at
can only access a single value at a time.
df.loc
can select multiple rows and/or columns.
Note that there is also df.get_value
, which may be even quicker at accessing single values:
In [25]: %timeit df.loc[('a', 'A'), ('c', 'C')]
10000 loops, best of 3: 187 µs per loop
In [26]: %timeit df.at[('a', 'A'), ('c', 'C')]
100000 loops, best of 3: 8.33 µs per loop
In [35]: %timeit df.get_value(('a', 'A'), ('c', 'C'))
100000 loops, best of 3: 3.62 µs per loop
Under the hood, df.at[...]
calls df.get_value
, but it also does some type checking on the keys.
Pandas, loc vs non loc for boolean indexing
As per the docs, loc
accepts a boolean array for selecting rows, and in your case
>>> df['a'] >= 15
>>>
0 False
1 False
2 False
3 False
4 False
5 False
6 True
7 True
Name: a, dtype: bool
is treated as a boolean array.
The fact that you can omit loc
here and issue df[df['a'] >= 15]
is a special case convenience according to Wes McKinney, the author of pandas
.
Quoting directly from his book, Python for Data Analysis, p. 144, df[val]
is used to...
Select single column or sequence of columns from the DataFrame; special case
conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame
(set values based on some criterion)
What would be the syntactical classification of 'loc' and 'iloc' in pandas?
Both LOC and ILOC are methods as they're associated with the Pandas module.
To access values from rows and columns within a Dataframe, both LOC and ILOC are used. One can use these methods to filter and modify values within DF.
LOC - loc() is a label-based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc().
ILOC - iloc() is an indexed-based selecting method which means that we have to pass integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it unlike loc()
Example:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10,100, (5, 4)), columns = list("ABCD"))
df.loc[1:3, "A":"C"]
before the comma, the colon takes row selections and after the comma, the colon takes column selections, here we've to specify the labels of the rows as well as the columns
df.iloc[1:3, 1:3]
before the comma, the colon takes row selections and after a comma, the colon takes column selections, here we've to specify the index positions of the rows as well as the columns
Related Topics
Libxml Install Error Using Pip
How to Change Values in a Tuple
How Did Python Implement the Built-In Function Pow()
A Good Way to Get the Charset/Encoding of an Http Response in Python
How to Check the Difference, in Seconds, Between Two Dates
Execute a File with Arguments in Python Shell
Installing Pip Packages to $Home Folder
Reading a Text File and Splitting It into Single Words in Python
How to Set the Aspect Ratio in Matplotlib
Pandas Split Column into Multiple Columns by Comma
Add Pygame Module in Pycharm Ide
Plotting a Decision Boundary Separating 2 Classes Using Matplotlib's Pyplot