When is i += x different from i = i + x in Python?
This depends entirely on the object i
.
+=
calls the __iadd__
method (if it exists -- falling back on __add__
if it doesn't exist) whereas +
calls the __add__
method1 or the __radd__
method in a few cases2.
From an API perspective, __iadd__
is supposed to be used for modifying mutable objects in place (returning the object which was mutated) whereas __add__
should return a new instance of something. For immutable objects, both methods return a new instance, but __iadd__
will put the new instance in the current namespace with the same name that the old instance had. This is why
i = 1
i += 1
seems to increment i
. In reality, you get a new integer and assign it "on top of" i
-- losing one reference to the old integer. In this case, i += 1
is exactly the same as i = i + 1
. But, with most mutable objects, it's a different story:
As a concrete example:
a = [1, 2, 3]
b = a
b += [1, 2, 3]
print(a) # [1, 2, 3, 1, 2, 3]
print(b) # [1, 2, 3, 1, 2, 3]
compared to:
a = [1, 2, 3]
b = a
b = b + [1, 2, 3]
print(a) # [1, 2, 3]
print(b) # [1, 2, 3, 1, 2, 3]
notice how in the first example, since b
and a
reference the same object, when I use +=
on b
, it actually changes b
(and a
sees that change too -- After all, it's referencing the same list). In the second case however, when I do b = b + [1, 2, 3]
, this takes the list that b
is referencing and concatenates it with a new list [1, 2, 3]
. It then stores the concatenated list in the current namespace as b
-- With no regard for what b
was the line before.
1In the expression x + y
, if x.__add__
isn't implemented or if x.__add__(y)
returns NotImplemented
and x
and y
have different types, then x + y
tries to call y.__radd__(x)
. So, in the case where you have
foo_instance += bar_instance
if Foo
doesn't implement __add__
or __iadd__
then the result here is the same as
foo_instance = bar_instance.__radd__(bar_instance, foo_instance)
2In the expression foo_instance + bar_instance
, bar_instance.__radd__
will be tried before foo_instance.__add__
if the type of bar_instance
is a subclass of the type of foo_instance
(e.g. issubclass(Bar, Foo)
). The rationale for this is that Bar
is in some sense a "higher-level" object than Foo
so Bar
should get the option of overriding Foo
's behavior.
Is .ix() always better than .loc() and .iloc() since it is faster and supports integer and label access?
Please refer to the doc Different Choices for Indexing, it states clearly when and why you should use .loc, .iloc over .ix, it's about explicit use case:
.ix supports mixed integer and label based access. It is primarily
label based, but will fall back to integer positional access unless
the corresponding axis is of integer type. .ix is the most general and
will support any of the inputs in .loc and .iloc. .ix also supports
floating point label schemes. .ix is exceptionally useful when dealing
with mixed positional and label based hierachical indexes.However, when an axis is integer based, ONLY label based access and
not positional access is supported. Thus, in such cases, it’s usually
better to be explicit and use .iloc or .loc.
Update 22 Mar 2017
Thanks to comment from @Alexander, Pandas is going to deprecate ix
in 0.20, details in here.
One of the strong reason behind is because mixing indexes -- positional and label (effectively using ix
) has been a significant source of problems for users.
It is expected to migrate to use iloc
and loc
instead, here is a link on how to convert code.
Subsetting DataFrame using ix in Python
You could use X['var2'].iloc[[0,1]]
:
In [280]: X['var2'].iloc[[0,1]]
Out[280]:
0 NaN
4 9
Name: var2, dtype: float64
Since X['var2']
is a view of X
, X['var2'].iloc[[0,1]]
is safe for both
access and assignments. But be careful if you use this "chained indexing"
pattern (such as the index-by-column-then-index-by-iloc
pattern used here) for assignments, since it does not
generalize to the case of assignments with multiple columns.
For example, X[['var2', 'var3']].iloc[[0,1]] = ...
generates a copy of a
sub-DataFrame of X so assignment to this sub-DataFrame does not modify X
.
See the docs on "Why assignments using chained indexing
fails" for more explanation.
To be concrete and to show why this view-vs-copy distinction is important: If you have this warning turned on:
pd.options.mode.chained_assignment = 'warn'
then this assign raises a SettingWithCopyWarning
warning:
In [252]: X[['var2', 'var3']].iloc[[0,1]] = 100
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a
DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
and the assignment fails to modify X
. Eek!
In [281]: X
Out[281]:
var1 var2 var3
0 3 NaN 11
4 3 9 13
3 2 NaN 14
2 5 9 12
1 2 7 13
To get around this problem, when you want an assignment to affect X
, you must
assign to a single indexer (e.g. X.iloc = ...
or X.loc = ...
or X.ix = ...
) -- that is, without chained indexing.
In this case, you could use
In [265]: X.iloc[[0,1], X.columns.get_indexer_for(['var2', 'var3'])] = 100
In [266]: X
Out[266]:
var1 var2 var3
0 3 100 100
4 3 100 100
3 2 NaN 14
2 5 9 12
1 2 7 13
but I wonder if there is a better way, since this is not terribly pretty.
pandas df.ix[number, column] accesses different scalar type than df[column].ix[number]
It's happens because pandas Series
could be only with one type, when you are doing df1.ix[0,'y']
you accessing to the first row to the column y
. Row contains float
variable, so everything is converted to np.float64
. When you're calling df1['y'].ix[0]
you accessing to the column y
which has dtype
np.int32
to the first element. Everything works as expected. So for your question about most robust way I think second method is preferable because you always know your dtype
of the column or you could easily check it while for row
it could be converted automatically.
Btw if you are accessing to the element by position (when you are using column) it's preferrable to use iloc
. From docs for ix
:
.ix
supports mixed integer and label based access. It is primarily
label based, but will fall back to integer positional access unless
the corresponding axis is of integer type..ix
is the most general and
will support any of the inputs in.loc
and.iloc
..ix
also supports
floating point label schemes..ix
is exceptionally useful when dealing
with mixed positional and label based hierarchical indexes.However, when an axis is integer based, ONLY label based access and
not positional access is supported. Thus, in such cases, it’s usually
better to be explicit and use.iloc
or.loc
If you need to access to only scalar you should also consider iat
method. From docs:
Since indexing with
[]
must handle a lot of cases (single-label
access, slicing, boolean indexing, etc.), it has a bit of overhead in
order to figure out what you’re asking for. If you only want to access
a scalar value, the fastest way is to use theat
andiat
methods,
which are implemented on all of the data structures.
Benchmarking:
In [129]: %timeit df1.y.ix[0]
10000 loops, best of 3: 30.2 us per loop
In [130]: %timeit df1.y.iloc[0]
10000 loops, best of 3: 24.6 us per loop
In [131]: %timeit df1.y.iat[0]
100000 loops, best of 3: 18.8 us per loop
How are iloc and loc different?
Label vs. Location
The main distinction between the two methods is:
loc
gets rows (and/or columns) with particular labels.iloc
gets rows (and/or columns) at integer locations.
To demonstrate, consider a series s
of characters with a non-monotonic integer index:
>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2])
49 a
48 b
47 c
0 d
1 e
2 f
>>> s.loc[0] # value at index label 0
'd'
>>> s.iloc[0] # value at index location 0
'a'
>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e
>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a
Here are some of the differences/similarities between s.loc
and s.iloc
when passed various objects:
<object> | description | s.loc[<object>] | s.iloc[<object>] |
---|---|---|---|
0 | single item | Value at index label 0 (the string 'd' ) | Value at index location 0 (the string 'a' ) |
0:1 | slice | Two rows (labels 0 and 1 ) | One row (first row at location 0) |
1:47 | slice with out-of-bounds end | Zero rows (empty Series) | Five rows (location 1 onwards) |
1:47:-1 | slice with negative step | three rows (labels 1 back to 47 ) | Zero rows (empty Series) |
[2, 0] | integer list | Two rows with given labels | Two rows with given locations |
s > 'e' | Bool series (indicating which values have the property) | One row (containing 'f' ) | NotImplementedError |
(s>'e').values | Bool array | One row (containing 'f' ) | Same as loc |
999 | int object not in index | KeyError | IndexError (out of bounds) |
-1 | int object not in index | KeyError | Returns last value in s |
lambda x: x.index[3] | callable applied to series (here returning 3rd item in index) | s.loc[s.index[3]] | s.iloc[s.index[3]] |
speed of .loc, .iloc, and the deprecated .ix. Why not use .ix?
ix
has to make assumptions as to what the labels mean. This is not intuitive behaviour, and may lead to serious breakage on corner cases (such as when your column labels are integers themselves). With loc
, you're only passing labels. With iloc
, you're only passing integer position indexes. The input is obvious and the output is as well.
Now, the speed differences mentioned are of the order of milliseconds or microseconds which is a "seriously, don't worry about it™" kind of difference. I consider that a worthy tradeoff for a more consistent, robust API. 'Nuff said.
Is x%(1e9 + 7) and x%(10**9 + 7) different in Python? If yes, why?
But I still don't understand how a zero after decimal point can cause difference.
Where the decimal point isn't important. It's floating!
Why is this difference arising?
Because floating point numbers in Python are the usual hardware ones which means they have limited storage and precision:
>>> int(123123123123123123123.0)
123123123123123126272
# ^^^^ different!
But integer numbers in Python have infinite storage and precision ("bignum"):
>>> int(123123123123123123123)
123123123123123123123
So:
>>> 123123123123123123123 % 10**9
123123123
>>> 123123123123123123123 % 1e9
123126272.0
In the second case both sides are converted to floating point because one of them is.
Force .ix to return a DataFrame in pandas
Yes, index it with a list:
> x = df.ix[[0]]
> y = df.ix[[1]]
> type(x)
pandas.core.frame.DataFrame
> type(y)
pandas.core.frame.DataFrame
Related Topics
How to Use Pickle to Save a Dict (Or Any Other Python Object)
Convert String to Variable Name in Python
Import a Module from a Relative Path
Installing Pip Is Not Working in Python ≪ 3.6
Pass a List to a Function to Act as Multiple Arguments
String Count With Overlapping Occurrences
Using Numpy to Build an Array of All Combinations of Two Arrays
Error Message: "'Chromedriver' Executable Needs to Be Available in the Path"
Manually Raising (Throwing) an Exception in Python
Permutations With Unique Values
Apply Multiple Functions to Multiple Groupby Columns
Python Subprocess Readlines() Hangs
How to Get the Value of a Variable Given Its Name in a String