When Is "I += X" Different from "I = I + X" in Python

When is i += x different from i = i + x in Python?

This depends entirely on the object i.

+= calls the __iadd__ method (if it exists -- falling back on __add__ if it doesn't exist) whereas + calls the __add__ method1 or the __radd__ method in a few cases2.

From an API perspective, __iadd__ is supposed to be used for modifying mutable objects in place (returning the object which was mutated) whereas __add__ should return a new instance of something. For immutable objects, both methods return a new instance, but __iadd__ will put the new instance in the current namespace with the same name that the old instance had. This is why

i = 1
i += 1

seems to increment i. In reality, you get a new integer and assign it "on top of" i -- losing one reference to the old integer. In this case, i += 1 is exactly the same as i = i + 1. But, with most mutable objects, it's a different story:

As a concrete example:

a = [1, 2, 3]
b = a
b += [1, 2, 3]
print(a) # [1, 2, 3, 1, 2, 3]
print(b) # [1, 2, 3, 1, 2, 3]

compared to:

a = [1, 2, 3]
b = a
b = b + [1, 2, 3]
print(a) # [1, 2, 3]
print(b) # [1, 2, 3, 1, 2, 3]

notice how in the first example, since b and a reference the same object, when I use += on b, it actually changes b (and a sees that change too -- After all, it's referencing the same list). In the second case however, when I do b = b + [1, 2, 3], this takes the list that b is referencing and concatenates it with a new list [1, 2, 3]. It then stores the concatenated list in the current namespace as b -- With no regard for what b was the line before.


1In the expression x + y, if x.__add__ isn't implemented or if x.__add__(y) returns NotImplemented and x and y have different types, then x + y tries to call y.__radd__(x). So, in the case where you have

foo_instance += bar_instance

if Foo doesn't implement __add__ or __iadd__ then the result here is the same as

foo_instance = bar_instance.__radd__(bar_instance, foo_instance)

2In the expression foo_instance + bar_instance, bar_instance.__radd__ will be tried before foo_instance.__add__ if the type of bar_instance is a subclass of the type of foo_instance (e.g. issubclass(Bar, Foo)). The rationale for this is that Bar is in some sense a "higher-level" object than Foo so Bar should get the option of overriding Foo's behavior.

Is .ix() always better than .loc() and .iloc() since it is faster and supports integer and label access?

Please refer to the doc Different Choices for Indexing, it states clearly when and why you should use .loc, .iloc over .ix, it's about explicit use case:

.ix supports mixed integer and label based access. It is primarily
label based, but will fall back to integer positional access unless
the corresponding axis is of integer type. .ix is the most general and
will support any of the inputs in .loc and .iloc. .ix also supports
floating point label schemes. .ix is exceptionally useful when dealing
with mixed positional and label based hierachical indexes.

However, when an axis is integer based, ONLY label based access and
not positional access is supported. Thus, in such cases, it’s usually
better to be explicit and use .iloc or .loc.

Update 22 Mar 2017

Thanks to comment from @Alexander, Pandas is going to deprecate ix in 0.20, details in here.

One of the strong reason behind is because mixing indexes -- positional and label (effectively using ix) has been a significant source of problems for users.

It is expected to migrate to use iloc and loc instead, here is a link on how to convert code.

Subsetting DataFrame using ix in Python

You could use X['var2'].iloc[[0,1]]:

In [280]: X['var2'].iloc[[0,1]]
Out[280]:
0 NaN
4 9
Name: var2, dtype: float64

Since X['var2'] is a view of X, X['var2'].iloc[[0,1]] is safe for both
access and assignments. But be careful if you use this "chained indexing"
pattern (such as the index-by-column-then-index-by-iloc pattern used here) for assignments, since it does not
generalize to the case of assignments with multiple columns.

For example, X[['var2', 'var3']].iloc[[0,1]] = ... generates a copy of a
sub-DataFrame of X so assignment to this sub-DataFrame does not modify X.
See the docs on "Why assignments using chained indexing
fails" for more explanation.

To be concrete and to show why this view-vs-copy distinction is important: If you have this warning turned on:

pd.options.mode.chained_assignment = 'warn'

then this assign raises a SettingWithCopyWarning warning:

In [252]: X[['var2', 'var3']].iloc[[0,1]] = 100
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a
DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)

and the assignment fails to modify X. Eek!

In [281]: X
Out[281]:
var1 var2 var3
0 3 NaN 11
4 3 9 13
3 2 NaN 14
2 5 9 12
1 2 7 13

To get around this problem, when you want an assignment to affect X, you must
assign to a single indexer (e.g. X.iloc = ... or X.loc = ... or X.ix = ...) -- that is, without chained indexing.

In this case, you could use

In [265]: X.iloc[[0,1], X.columns.get_indexer_for(['var2', 'var3'])] = 100

In [266]: X
Out[266]:
var1 var2 var3
0 3 100 100
4 3 100 100
3 2 NaN 14
2 5 9 12
1 2 7 13

but I wonder if there is a better way, since this is not terribly pretty.

pandas df.ix[number, column] accesses different scalar type than df[column].ix[number]

It's happens because pandas Series could be only with one type, when you are doing df1.ix[0,'y'] you accessing to the first row to the column y. Row contains float variable, so everything is converted to np.float64. When you're calling df1['y'].ix[0] you accessing to the column y which has dtype np.int32 to the first element. Everything works as expected. So for your question about most robust way I think second method is preferable because you always know your dtype of the column or you could easily check it while for row it could be converted automatically.

Btw if you are accessing to the element by position (when you are using column) it's preferrable to use iloc. From docs for ix:

.ix supports mixed integer and label based access. It is primarily
label based, but will fall back to integer positional access unless
the corresponding axis is of integer type. .ix is the most general and
will support any of the inputs in .loc and .iloc. .ix also supports
floating point label schemes. .ix is exceptionally useful when dealing
with mixed positional and label based hierarchical indexes.

However, when an axis is integer based, ONLY label based access and
not positional access is supported. Thus, in such cases, it’s usually
better to be explicit and use .iloc or .loc

If you need to access to only scalar you should also consider iat method. From docs:

Since indexing with [] must handle a lot of cases (single-label
access, slicing, boolean indexing, etc.), it has a bit of overhead in
order to figure out what you’re asking for. If you only want to access
a scalar value, the fastest way is to use the at and iat methods,
which are implemented on all of the data structures.

Benchmarking:

In [129]: %timeit df1.y.ix[0]
10000 loops, best of 3: 30.2 us per loop

In [130]: %timeit df1.y.iloc[0]
10000 loops, best of 3: 24.6 us per loop

In [131]: %timeit df1.y.iat[0]
100000 loops, best of 3: 18.8 us per loop

How are iloc and loc different?

Label vs. Location

The main distinction between the two methods is:

  • loc gets rows (and/or columns) with particular labels.

  • iloc gets rows (and/or columns) at integer locations.

To demonstrate, consider a series s of characters with a non-monotonic integer index:

>>> s = pd.Series(list("abcdef"), index=[49, 48, 47, 0, 1, 2]) 
49 a
48 b
47 c
0 d
1 e
2 f

>>> s.loc[0] # value at index label 0
'd'

>>> s.iloc[0] # value at index location 0
'a'

>>> s.loc[0:1] # rows at index labels between 0 and 1 (inclusive)
0 d
1 e

>>> s.iloc[0:1] # rows at index location between 0 and 1 (exclusive)
49 a

Here are some of the differences/similarities between s.loc and s.iloc when passed various objects:









































































<object>descriptions.loc[<object>]s.iloc[<object>]
0single itemValue at index label 0 (the string 'd')Value at index location 0 (the string 'a')
0:1sliceTwo rows (labels 0 and 1)One row (first row at location 0)
1:47slice with out-of-bounds endZero rows (empty Series)Five rows (location 1 onwards)
1:47:-1slice with negative stepthree rows (labels 1 back to 47)Zero rows (empty Series)
[2, 0]integer listTwo rows with given labelsTwo rows with given locations
s > 'e'Bool series (indicating which values have the property)One row (containing 'f')NotImplementedError
(s>'e').valuesBool arrayOne row (containing 'f')Same as loc
999int object not in indexKeyErrorIndexError (out of bounds)
-1int object not in indexKeyErrorReturns last value in s
lambda x: x.index[3]callable applied to series (here returning 3rd item in index)s.loc[s.index[3]]s.iloc[s.index[3]]

speed of .loc, .iloc, and the deprecated .ix. Why not use .ix?

ix has to make assumptions as to what the labels mean. This is not intuitive behaviour, and may lead to serious breakage on corner cases (such as when your column labels are integers themselves). With loc, you're only passing labels. With iloc, you're only passing integer position indexes. The input is obvious and the output is as well.

Now, the speed differences mentioned are of the order of milliseconds or microseconds which is a "seriously, don't worry about it™" kind of difference. I consider that a worthy tradeoff for a more consistent, robust API. 'Nuff said.

Is x%(1e9 + 7) and x%(10**9 + 7) different in Python? If yes, why?

But I still don't understand how a zero after decimal point can cause difference.

Where the decimal point isn't important. It's floating!

Why is this difference arising?

Because floating point numbers in Python are the usual hardware ones which means they have limited storage and precision:

>>> int(123123123123123123123.0)
123123123123123126272
# ^^^^ different!

But integer numbers in Python have infinite storage and precision ("bignum"):

>>> int(123123123123123123123)
123123123123123123123

So:

>>> 123123123123123123123 % 10**9
123123123

>>> 123123123123123123123 % 1e9
123126272.0

In the second case both sides are converted to floating point because one of them is.

Force .ix to return a DataFrame in pandas

Yes, index it with a list:

> x = df.ix[[0]]
> y = df.ix[[1]]
> type(x)
pandas.core.frame.DataFrame
> type(y)
pandas.core.frame.DataFrame


Related Topics



Leave a reply



Submit