Differencebetween Using Loc and Using Just Square Brackets to Filter for Columns in Pandas/Python

What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

In the following situations, they behave the same:

  1. Selecting a single column (df['A'] is the same as df.loc[:, 'A'] -> selects column A)
  2. Selecting a list of columns (df[['A', 'B', 'C']] is the same as df.loc[:, ['A', 'B', 'C']] -> selects columns A, B and C)
  3. Slicing by rows (df[1:3] is the same as df.iloc[1:3] -> selects rows 1 and 2. Note, however, if you slice rows with loc, instead of iloc, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)

However, [] does not work in the following situations:

  1. You can select a single row with df.loc[row_label]
  2. You can select a list of rows with df.loc[[row_label1, row_label2]]
  3. You can slice columns with df.loc[:, 'A':'C']

These three cannot be done with [].
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.

df[1:3]['A'] = 5

This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:

df.loc[1:3, 'A'] = 5

With .loc, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']), select a single row (df.loc[5]), and select a list of rows (df.loc[[1, 2, 5]]).

Also note that these two were not included in the API at the same time. .loc was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.


Note: Getting columns with [] vs . is a completely different topic. . is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1 won't work if there is no column a). Other than that, . and [] are the same.

performance using loc vs simply using inside square brackets

Second is a bit faster, for me it has sense, because first solution is combination DataFrame.loc and boolean indexing, second only boolean indexing:

np.random.seed(2021)
table = pd.DataFrame(np.random.rand(10**7, 5), columns=list('abcde'))
table['some_col'] = table.a > 0.6

In [130]: %timeit table.loc[table.some_col==True, :]
258 ms ± 2.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [131]: %timeit df = table[table.some_col==True]
241 ms ± 1.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

accessing pandas columns with loc and square brackets comparison element wise

NaNs are not equal to themselves. See: Why is NaN not equal to NaN?

As for checking equality of energyDF.loc[:,'Wind'] == energyDF['Wind']

you could fillna both sides with a value (preferably one that doesn't occur in the series) and then check that both are indeed identical

as an example:

>>> df
ID Col1
0 1.0 AD
1 NaN BC
2 3.0 CE
>>> (df.loc[:, 'ID'] == df['ID']).all()
False
>>> (df.loc[:, 'ID'].fillna("Non-existent") == df['ID'].fillna("Non-existent")).all()
True

Pandas filter vs. loc method

.loc[] is a Purely label-location based indexer for selection by label. It fails when the selection isn't found, only accepts certain types of input and works on only one axis of your dataframe.

df.filter() returns Subset rows or columns of dataframe according to labels in the specified index. You can filter along either axis, and you can filter in more advanced ways than with loc.

filter will return the same type of object as the caller, whereas loc will return the value specified by the label (so a Series if caller is a DF, a scalar if caller is a Series).

In short, .loc is for accessing a specific item within the caller, .filter() is for applying a filter to the caller and returning only items which match that filter.

Is loc an optional attribute when searching dataframe?

Coming from Padas.DataFrame.loc documentation:

Access a group of rows and columns by label(s) or a boolean array.

.loc[] is primarily label based, but may also be used with a boolean
array.

When you are using Boolean array to filter out data, .loc is optional, and in your example df['MRP'] > 1500 gives a Series with the values of truthfulness, so it's not necessary to use .loc in that case.

df[df['MRP']>15]
MRP cat
0 18 A
3 19 D
6 18 C

But if you want to access some other columns where this Boolean Series has True value, then you may use .loc:

df.loc[df['MRP']>15, 'cat']
0 A
3 D
6 C

Or, if you want to change the values where the condition is True:

df.loc[df['MRP']>15, 'cat'] = 'found'

Accessing Pandas column using squared brackets vs using a dot (like an attribute)

The "dot notation", i.e. df.col2 is the attribute access that's exposed as a convenience.

You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:

df['col2'] does the same: it returns a pd.Series of the column.

A few caveats about attribute access:

  • you cannot add a column (df.new_col = x won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here)
  • it won't work if you have spaces in the column name or if the column name is an integer.

I'm having a hard time understanding what pandas.DataFrame.loc does in this line of code

The first line converts the symboling column to object and replaces it in the dataframe.

On the left-hand side, data.loc[:,'symboling'] selects all rows (the : part is a slice) and the symboling column.

loc is probably being used here to avoid a SettingWithCopy warning, which might occur if the author had written:

data['symboling'] = data['symboling'].astype('object')

See also: What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?

What would be the syntactical classification of 'loc' and 'iloc' in pandas?

Both LOC and ILOC are methods as they're associated with the Pandas module.

To access values from rows and columns within a Dataframe, both LOC and ILOC are used. One can use these methods to filter and modify values within DF.

LOC - loc() is a label-based data selecting method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range passed in it, unlike iloc().

ILOC - iloc() is an indexed-based selecting method which means that we have to pass integer index in the method to select a specific row/column. This method does not include the last element of the range passed in it unlike loc()

Example:

import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10,100, (5, 4)), columns = list("ABCD"))

df.loc[1:3, "A":"C"]

before the comma, the colon takes row selections and after the comma, the colon takes column selections, here we've to specify the labels of the rows as well as the columns

df.iloc[1:3, 1:3] 

before the comma, the colon takes row selections and after a comma, the colon takes column selections, here we've to specify the index positions of the rows as well as the columns



Related Topics



Leave a reply



Submit