Handling of Duplicate Indices in Numpy Assignments

Handling of duplicate indices in NumPy assignments

In NumPy 1.9 and later this will in general not be well defined.

The current implementation iterates over all (broadcasted) fancy indexes (and the assignment array) at the same time using separate iterators, and these iterators all use C-order. In other words, currently, yes you can. Since you maybe want to know it more exact. If you compare mapping.c in NumPy, which handles these things, you will see that it uses PyArray_ITER_NEXT, which is documented to be in C-order.

For the future I would paint the picture differently. I think it would be good to iterate all indices + the assignment array together using the newer iterator. If this is done, then the order could be kept open for the iterator to decide the fastest way. If you keep it open to the iterator, it is hard to say what would happen, but you cannot be certain that your example works (probably the 1-d case you still can, but...).

So, as far as I can tell it works currently, but it is undocumented (for all I know) so if you actually think that this should be ensured, you would need to lobby for it and best write some tests to make sure it can be guaranteed. Because at least am tempted to say: if it makes things faster, there is no reason to ensure C-order, but of course maybe there is a good reason hidden somewhere...

The real question here is: Why do you want that anyway? ;)

Will numpy keep order of assignment when existing duplicated indexes?

An simpler equivalent to Divakar's solution.

def assign_last(a, index, b):
    """a[index] = b
    """
    index = index[::-1]
    b = b[::-1]

    ix_unique, ix_first = np.unique(index, return_index=True)
    # np.unique: return index of first occurrence.
    # ix_unique = index[ix_first]

    a[ix_unique] = b[ix_first]
    return a

a =  array([0, 1, 2, 3, 4])
index = array([1, 2, 3, 1, 2, 1, 2, 3, 4, 2])
b = array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
assign_last(a, index, b)

Output

array([0, 5, 9, 7, 8])

numpy.array.iadd and repeated indices

for this numpy 1.8 added the at reduction:

at(a, indices, b=None)

Performs unbuffered in place operation on operand 'a' for elements
specified by 'indices'. For addition ufunc, this method is equivalent
to a[indices] += b, except that results are accumulated for elements
that are indexed more than once. For example, a[[0,0]] += 1 will
only increment the first element once because of buffering, whereas
add.at(a, [0,0], 1) will increment the first element twice.

.. versionadded:: 1.8.0

In [1]: A = np.array([0, 0, 0])
In [2]: B = np.array([1, 1, 1, 1, 1, 1])
In [3]: idx = [0, 0, 1, 1, 2, 2]
In [4]: np.add.at(A, idx, B)
In [5]: A
Out[5]: array([2, 2, 2])

How to use an additive assignment with list based indexing in Numpy

You are going to have to figure out the repeated items and add them together before updating your array. The following code shows a way of doing that for your first update:

rows, cols = 100, 100
items = 1000

rho = np.zeros((rows, cols))
rho_coeff, dt, i_frac, j_frac = np.random.rand(4, items)
pi = np.random.randint(1, rows-1, size=(items,))
pj = np.random.randint(1, cols-1, size=(items,))

# The following code assumes pi and pj have the same dtype
pij = np.column_stack((pi, pj)).view((np.void,
                                      2*pi.dtype.itemsize)).ravel()

unique_coords, indices = np.unique(pij, return_inverse=True)
unique_coords = unique_coords.view(pi.dtype).reshape(-1, 2)
data = rho_coeff*dt*i_frac*j_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data

I think you can reuse all of the unique coordinate finding above for the other updates, so the following would work:

ip1_frac, jp1_frac = np.random.rand(2, items)

unique_coords[:, 0] += 1
data =  rho_coeff*dt*ip1_frac*j_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data

unique_coords[:, 1] += 1
data =  rho_coeff*dt*ip1_frac*jp1_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data

unique_coords[:, 0] -= 1
data =  rho_coeff*dt*i_frac*jp1_frac
binned_data = np.bincount(indices, weights=data)
rho[tuple(unique_coords.T)] += binned_data

How do numpy's in-place operations (e.g. `+=`) work?

The first thing you need to realise is that a += x doesn't map exactly to a.__iadd__(x), instead it maps to a = a.__iadd__(x). Notice that the documentation specifically says that in-place operators return their result, and this doesn't have to be self (although in practice, it usually is). This means a[i] += x trivially maps to:

a.__setitem__(i, a.__getitem__(i).__iadd__(x))

So, the addition technically happens in-place, but only on a temporary object. There is still potentially one less temporary object created than if it called __add__, though.

Unexpected behaviour numpy array indexing

The reason why you might find the result in the two last cases unexpected, is because the indexing of the array is following the rules of advanced indexing, even though you're also indexing with slices.

For an extensive explanation behind this behaviour, you can check combining advanced and basic indexing. In these last cases in which you're getting unexpected resulting shapes. In the docs, you'll see that one of the mentioned scenarios in which we might obtain unexpected results is when:

The advanced indexes are separated by a slice, Ellipsis or newaxis. For example x[arr1, :, arr2].

In your case, although you're only using an integer for indexing along the first axis, it is broadcast and both arrays are iterated as one.
In this case the dimensions resulting from the advanced indexing operation come first in the result array, and the sliced dimensions after that.

The key here is to understand that as mentioned in the docs, it is like concatenating the indexing result for each advanced index element.

So in essence it is doing the same as:

z = np.random.random((1,9,10,2))
a = np.concatenate([z[0,:,:,[1]], z[0,:,:,[0]]], axis=0)

Which is the same as the last indexing operation:

b = z[0,:,:,[1,0]]
np.allclose(a,b)
# True

What is the reason behind this behaviour?

A general rule to keep in mind is that:

The resulting axes introduced by the arrays indexes are at the front, unless they are consecutive.

So since the indexing arrays here are not consecutive, the resulting axes on which they’ve been used will come at the front, and the sliced dimension at the back.

While it might seem very weird when indexing with 1-dimensional arrays, take into account that it is also possible to index with arrays of an arbitrary amount of dimensions. Say we are indexing the same example array both on the first and last axes with 3d arrays, both say with shape (3,4,2). So we know that the final array will somewhere also have the shape (3,4,2), since both indexing arrays broadcast to the same shape. Now the question is, where should the full slice taken between the indexing arrays be placed?

Given that it is no longer as clear that it should go in the middle, there is a convention in these cases which is that sliced dimensions go at the end.
So in such cases it will be our task to rearrange the dimensions of the array to match our expected output. On the example above what we could do is to swap the last two axes and get as we expected with using swapaxes to get the dimensions arranged as expected.

Finding consecutive duplicates and listing their indexes of where they occur in python

You could use itertools.groupby, it will identify the contiguous groups in the list:

from itertools import groupby

lst = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]

groups = [(k, sum(1 for _ in g)) for k, g in groupby(lst)]

cursor = 0
result = []
for k, l in groups:
    if not k and l >= 5:
        result.append([cursor, cursor + l - 1])
    cursor += l

print(result)

Output

[[17, 21], [30, 35]]

How to handle ValueError: Index contains duplicate entries using df.pivot or pd.pivot_table?

You can do this also using pd.crosstab:

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='first').ffill(axis=1)
print(df_out)

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task2  Task2  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4

Or changing the aggfunc to 'last':

dfm = df.melt('ID', value_name='val')
df_out = pd.crosstab(dfm['ID'],dfm['val'],dfm['variable'],aggfunc='last').ffill(axis=1)
df_out

Output:

val      1      3      4      5      6
ID                                    
10   Task1  Task1  Task3  Task3  Task4
11   Task1  Task2  Task3  Task4  Task4
12   Task1  Task2  Task3  Task3  Task4

Efficiently handling duplicates in a Python list

You could use numpy.unique:

In [13]: x = np.array([1, 0, 0, 3, 3, 0])

In [14]: values, cluster_id = np.unique(x, return_inverse=True)

In [15]: values
Out[15]: array([0, 1, 3])

In [16]: cluster_id
Out[16]: array([1, 0, 0, 2, 2, 0])

(The cluster IDs are assigned in the order of the sorted unique values, not in the order of a value's first appearance in the input.)

Locations of the items in cluster 0:

In [22]: cid = 0

In [23]: values[cid]
Out[23]: 0

In [24]: (cluster_id == cid).nonzero()[0]
Out[24]: array([1, 2, 5])

Handling of Duplicate Indices in Numpy Assignments