Accessing Pandas column using squared brackets vs using a dot (like an attribute)
The "dot notation", i.e. df.col2
is the attribute access that's exposed as a convenience.
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:
df['col2']
does the same: it returns a pd.Series
of the column.
A few caveats about attribute access:
- you cannot add a column (
df.new_col = x
won't work, worse: it will silently actually create a new attribute rather than a column - think monkey-patching here) - it won't work if you have spaces in the column name or if the column name is an integer.
Speed difference between bracket notation and dot notation for accessing columns in pandas
df['CID']
delegates to NDFrame.__getitem__
and it is more obvious you are performing an indexing operation.
On the other hand, df.CID
delegates to NDFrame.__getattr__
, which has to do some additional heavy lifting, mainly to determine whether 'CID' is an attribute, a function, or a column you're calling using the attribute access (a convenience, but not recommended for production code).
Now, why is it not recommended? Consider,
df = pd.DataFrame({'A': [1, 2, 3]})
df.A
0 1
1 2
2 3
Name: A, dtype: int64
There are no issues referring to column "A" as df.A
, because it does not conflict with any attribute or function namings in pandas. However, consider the pop
function (just as an example).
df.pop
# <bound method NDFrame.pop of ...>
df.pop
is a bound method of df
. Now, I'd like to create a column called "pop" for various reasons.
df['pop'] = [4, 5, 6]
df
A pop
0 1 4
1 2 5
2 3 6
Great, but,
df.pop
# <bound method NDFrame.pop of ...>
I cannot use the attribute notation to access this column. However...
df['pop']
0 4
1 5
2 6
Name: pop, dtype: int64
Bracket notation still works. That's why this is better.
pandas dataframe where clause with dot versus brackets column selection
The dot notation is just a convenient shortcut for accessing things vs. the standard brackets. Notably, they don't work when the column name is something like sum
that is already a DataFrame method. My bet would be that the column name in your real example is running into that issue, and so it works fine with the bracket selection but is otherwise testing whether a method is equal to 'blah'
.
Quick example below:
In [67]: df = pd.DataFrame(np.arange(10).reshape(5,2), columns=["number", "sum"])
In [68]: df
Out[68]:
number sum
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [69]: df.number == 0
Out[69]:
0 True
1 False
2 False
3 False
4 False
Name: number, dtype: bool
In [70]: df.sum == 0
Out[70]: False
In [71]: df['sum'] == 0
Out[71]:
0 False
1 False
2 False
3 False
4 False
Name: sum, dtype: bool
What is the difference between using loc and using just square brackets to filter for columns in Pandas/Python?
In the following situations, they behave the same:
- Selecting a single column (
df['A']
is the same asdf.loc[:, 'A']
-> selects column A) - Selecting a list of columns (
df[['A', 'B', 'C']]
is the same asdf.loc[:, ['A', 'B', 'C']]
-> selects columns A, B and C) - Slicing by rows (
df[1:3]
is the same asdf.iloc[1:3]
-> selects rows 1 and 2. Note, however, if you slice rows withloc
, instead ofiloc
, you'll get rows 1, 2 and 3 assuming you have a RangeIndex. See details here.)
However, []
does not work in the following situations:
- You can select a single row with
df.loc[row_label]
- You can select a list of rows with
df.loc[[row_label1, row_label2]]
- You can slice columns with
df.loc[:, 'A':'C']
These three cannot be done with []
.
More importantly, if your selection involves both rows and columns, then assignment becomes problematic.
df[1:3]['A'] = 5
This selects rows 1 and 2 then selects column 'A' of the returning object and assigns value 5 to it. The problem is, the returning object might be a copy so this may not change the actual DataFrame. This raises SettingWithCopyWarning. The correct way of making this assignment is:
df.loc[1:3, 'A'] = 5
With .loc
, you are guaranteed to modify the original DataFrame. It also allows you to slice columns (df.loc[:, 'C':'F']
), select a single row (df.loc[5]
), and select a list of rows (df.loc[[1, 2, 5]]
).
Also note that these two were not included in the API at the same time. .loc
was added much later as a more powerful and explicit indexer. See unutbu's answer for more detail.
Note: Getting columns with []
vs .
is a completely different topic. .
is only there for convenience. It only allows accessing columns whose names are valid Python identifiers (i.e. they cannot contain spaces, they cannot be composed of numbers...). It cannot be used when the names conflict with Series/DataFrame methods. It also cannot be used for non-existing columns (i.e. the assignment df.a = 1
won't work if there is no column a
). Other than that, .
and []
are the same.
What's the difference between the square bracket and dot notations in Python?
The dot operator is used for accessing attributes of any object. For example, a complex number
>>> c = 3+4j
has (among others) the two attributes real
and imag
:
>>> c.real
3.0
>>> c.imag
4.0
As well as those, it has a method, conjugate()
, which is also an attribute:
>>> c.conjugate
<built-in method conjugate of complex object at 0x7f4422d73050>
>>> c.conjugate()
(3-4j)
Square bracket notation is used for accessing members of a collection, whether that's by key in the case of a dictionary or other mapping:
>>> d = {'a': 1, 'b': 2}
>>> d['a']
1
... or by index in the case of a sequence like a list or string:
>>> s = ['x', 'y', 'z']
>>> s[2]
'z'
>>> t = 'Kapow!'
>>> t[3]
'o'
These collections also, separately, have attributes:
>>> d.pop
<built-in method pop of dict object at 0x7f44204068c8>
>>> s.reverse
<built-in method reverse of list object at 0x7f4420454d08>
>>> t.lower
<built-in method lower of str object at 0x7f4422ce2688>
... and again, in the above cases, these attributes happen to be methods.
While all objects have some attributes, not all objects have members. For example, if we try to use square bracket notation to access a member of our complex number c
:
>>> c[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'complex' object is not subscriptable
... we get an error (which makes sense, since there's no obvious way for a complex number to have members).
It's possible to define how []
and .
access work in a user-defined class, using the special methods __getitem__()
and __getattr__()
respectively. Explaining how to do so is beyond the scope of this question, but you can read more about it in the Python Tutorial.
Propagation in Python - Pandas Series TypeError
Two options come to mind.
Option 1: use numpy.sqrt
:
import numpy as np
joinedDF['combined_error'] = np.sqrt((joinedDF['error1']**2 +
joinedDF['error2']**2))
Option 2: if you want/need to avoid numpy for some reason, you can apply
math.sqrt
to the numeric column. This is likely slower than Option 1, but works in my testing:
joinedDF['combined_error'] = (joinedDF['error1']**2 +
joinedDF['error2']**2).apply(math.sqrt)
Minor style comment: it's generally recommended to refer to DataFrame columns using indexing (square brackets) rather than attribute access (dot notation), so I modified your code accordingly. More reading: What is the difference between using squared brackets or dot to access a column?
The difference between double brace `[[...]]` and single brace `[..]` indexing in Pandas
Consider this:
Source DF:
In [79]: df
Out[79]:
Brains Bodies
0 42 34
1 32 23
Selecting one column - results in Pandas.Series:
In [80]: df['Brains']
Out[80]:
0 42
1 32
Name: Brains, dtype: int64
In [81]: type(df['Brains'])
Out[81]: pandas.core.series.Series
Selecting subset of DataFrame - results in DataFrame:
In [82]: df[['Brains']]
Out[82]:
Brains
0 42
1 32
In [83]: type(df[['Brains']])
Out[83]: pandas.core.frame.DataFrame
Conclusion: the second approach allows us to select multiple columns from the DataFrame. The first one just for selecting single column...
Demo:
In [84]: df = pd.DataFrame(np.random.rand(5,6), columns=list('abcdef'))
In [85]: df
Out[85]:
a b c d e f
0 0.065196 0.257422 0.273534 0.831993 0.487693 0.660252
1 0.641677 0.462979 0.207757 0.597599 0.117029 0.429324
2 0.345314 0.053551 0.634602 0.143417 0.946373 0.770590
3 0.860276 0.223166 0.001615 0.212880 0.907163 0.437295
4 0.670969 0.218909 0.382810 0.275696 0.012626 0.347549
In [86]: df[['e','a','c']]
Out[86]:
e a c
0 0.487693 0.065196 0.273534
1 0.117029 0.641677 0.207757
2 0.946373 0.345314 0.634602
3 0.907163 0.860276 0.001615
4 0.012626 0.670969 0.382810
and if we specify only one column in the list we will get a DataFrame with one column:
In [87]: df[['e']]
Out[87]:
e
0 0.487693
1 0.117029
2 0.946373
3 0.907163
4 0.012626
serier to tolist() gives elements in squared brackets when appended python
ratio = df_fd.loc[(df_fd['variable'] == col) & (df_fd['Value'] == val)]['ratio'].values[0]
remove
- ratio=list(ratio)
Related Topics
Checking Running Python Script Within the Python Script
Find Out Who Is Logged in on Linux Using Python
How to Use Expect on Windows Without Installing Cygwin
How to Upload File with Python Requests
Creating a Dictionary from a CSV File
Weird Timezone Issue with Pytz
Python:No Module Named Selenium
Why Should I Make a Copy of a Data Frame in Pandas
Working with Utf-8 Encoding in Python Source
Is It Worth Using Python's Re.Compile
Unicodedecodeerror Reading Binary Input
Crontab Failed to Run Python Script at Reboot
Add Custom Method to String Object
Custom Sorting in Pandas Dataframe
Python List Sort in Descending Order
Why Does Adding a Trailing Comma After a Variable Name Make It a Tuple
"Nameerror: Name '' Is Not Defined" After User Input in Python
"For Line In..." Results in Unicodedecodeerror: 'Utf-8' Codec Can't Decode Byte