How to "Zip Sort" Parallel Numpy Arrays

How can I zip sort parallel numpy arrays?

b[a.argsort()] should do the trick.

Here's how it works. First you need to find a permutation that sorts a. argsort is a method that computes this:

>>> a = numpy.array([2, 3, 1])
>>> p = a.argsort()
>>> p
[2, 0, 1]

You can easily check that this is right:

>>> a[p]
array([1, 2, 3])

Now apply the same permutation to b.

>>> b = numpy.array([4, 6, 7])
>>> b[p]
array([7, 4, 6])

Sort two numpy matrices in parallel, row by row

You can use advanced indexing -

idxx = np.arange(a.shape[0])[:,None],a.argsort(1)
a_out = a[idxx]
b_out = b[idxx]

Sample run -

In [75]: a
Out[75]:
array([['b', 'c', 'd', 'e'],
['a', 'b', 'd', 'e']],
dtype='|S1')

In [76]: b
Out[76]:
array([['2', '1', '4', '3'],
['2', '4', '1', '3']],
dtype='|S1')

In [77]: a_out
Out[77]:
array([['b', 'c', 'd', 'e'],
['a', 'b', 'd', 'e']],
dtype='|S1')

In [78]: b_out
Out[78]:
array([['2', '1', '4', '3'],
['2', '4', '1', '3']],
dtype='|S1')

Sort a numpy array by another array, along a particular axis, using less memory

Would a record array serve your purposes?

>>> a = numpy.zeros((3, 3, 3))
>>> a += numpy.array((1, 3, 2)).reshape((3, 1, 1))
>>> b = numpy.arange(3*3*3).reshape((3, 3, 3))
>>> c = numpy.array(zip(a.flatten(), b.flatten()), dtype=[('f', float), ('i', int)]).reshape(3, 3, 3)
>>> c.sort(axis=0)
>>> c['i']
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],

[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]],

[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]])

A cleaner way to generate the coupled array:

>>> c = numpy.rec.fromarrays([a, b], dtype=[('f', float), ('i', int)])

or

>>> c = numpy.rec.fromarrays([a, b], names='f, i')

Sort a numpy array by another array, along a particular axis

You still have to supply indices for the other two dimensions for this to work correctly.

>>> a = numpy.zeros((3, 3, 3))
>>> a += numpy.array((1, 3, 2)).reshape((3, 1, 1))
>>> b = numpy.arange(3*3*3).reshape((3, 3, 3))
>>> sort_indices = numpy.argsort(a, axis=0)
>>> static_indices = numpy.indices((3, 3, 3))
>>> b[sort_indices, static_indices[1], static_indices[2]]
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],

[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]],

[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]]])

numpy.indices calculates the indices of each axis of the array when "flattened" through the other two axes (or n - 1 axes where n = total number of axes). In other words, this (apologies for the long post):

>>> static_indices
array([[[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]],

[[1, 1, 1],
[1, 1, 1],
[1, 1, 1]],

[[2, 2, 2],
[2, 2, 2],
[2, 2, 2]]],

[[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],

[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]],

[[0, 0, 0],
[1, 1, 1],
[2, 2, 2]]],

[[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]],

[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]],

[[0, 1, 2],
[0, 1, 2],
[0, 1, 2]]]])

These are the identity indices for each axis; when used to index b, they recreate b.

>>> b[static_indices[0], static_indices[1], static_indices[2]]
array([[[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8]],

[[ 9, 10, 11],
[12, 13, 14],
[15, 16, 17]],

[[18, 19, 20],
[21, 22, 23],
[24, 25, 26]]])

As an alternative to numpy.indices, you could use numpy.ogrid, as unutbu suggests. Since the object generated by ogrid is smaller, I'll create all three axes, just for consistency sake, but note unutbu's comment for a way to do this by generating only two.

>>> static_indices = numpy.ogrid[0:a.shape[0], 0:a.shape[1], 0:a.shape[2]]
>>> a[sort_indices, static_indices[1], static_indices[2]]
array([[[ 1., 1., 1.],
[ 1., 1., 1.],
[ 1., 1., 1.]],

[[ 2., 2., 2.],
[ 2., 2., 2.],
[ 2., 2., 2.]],

[[ 3., 3., 3.],
[ 3., 3., 3.],
[ 3., 3., 3.]]])

How to sort two lists (which reference each other) in the exact same way

One classic approach to this problem is to use the "decorate, sort, undecorate" idiom, which is especially simple using python's built-in zip function:

>>> list1 = [3,2,4,1, 1]
>>> list2 = ['three', 'two', 'four', 'one', 'one2']
>>> list1, list2 = zip(*sorted(zip(list1, list2)))
>>> list1
(1, 1, 2, 3, 4)
>>> list2
('one', 'one2', 'two', 'three', 'four')

These of course are no longer lists, but that's easily remedied, if it matters:

>>> list1, list2 = (list(t) for t in zip(*sorted(zip(list1, list2))))
>>> list1
[1, 1, 2, 3, 4]
>>> list2
['one', 'one2', 'two', 'three', 'four']

It's worth noting that the above may sacrifice speed for terseness; the in-place version, which takes up 3 lines, is a tad faster on my machine for small lists:

>>> %timeit zip(*sorted(zip(list1, list2)))
100000 loops, best of 3: 3.3 us per loop
>>> %timeit tups = zip(list1, list2); tups.sort(); zip(*tups)
100000 loops, best of 3: 2.84 us per loop

On the other hand, for larger lists, the one-line version could be faster:

>>> %timeit zip(*sorted(zip(list1, list2)))
100 loops, best of 3: 8.09 ms per loop
>>> %timeit tups = zip(list1, list2); tups.sort(); zip(*tups)
100 loops, best of 3: 8.51 ms per loop

As Quantum7 points out, JSF's suggestion is a bit faster still, but it will probably only ever be a little bit faster, because Python uses the very same DSU idiom internally for all key-based sorts. It's just happening a little closer to the bare metal. (This shows just how well optimized the zip routines are!)

I think the zip-based approach is more flexible and is a little more readable, so I prefer it.


Note that when elements of list1 are equal, this approach will end up comparing elements of list2. If elements of list2 don't support comparison, or don't produce a boolean when compared (for example, if list2 is a list of NumPy arrays), this will fail, and if elements of list2 are very expensive to compare, it might be better to avoid comparison anyway.

In that case, you can sort indices as suggested in jfs's answer, or you can give the sort a key function that avoids comparing elements of list2:

result1, result2 = zip(*sorted(zip(list1, list2), key=lambda x: x[0]))

Also, the use of zip(*...) as a transpose fails when the input is empty. If your inputs might be empty, you will have to handle that case separately.

Sort randomly generated Numpy array according to a different array

One way is to first calculate an ordering via argsort, then use this to index your input arrays::

import numpy as np

np.random.seed(0)

ages = np.random.randint(18, 40, size=10) # [30 33 39 18 21 21 25 27 37 39]
marks = np.random.randint(0, 100, size=10) # [36 87 70 88 88 12 58 65 39 87]

order = ages.argsort() # [3 4 5 6 7 0 1 8 2 9]

print(ages[order]) # [18 21 21 25 27 30 33 37 39 39]
print(marks[order]) # [88 88 12 58 65 36 87 39 70 87]

Better way to shuffle two numpy arrays in unison

Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

Example: Let's assume the arrays a and b look like this:

a = numpy.array([[[  0.,   1.,   2.],
[ 3., 4., 5.]],

[[ 6., 7., 8.],
[ 9., 10., 11.]],

[[ 12., 13., 14.],
[ 15., 16., 17.]]])

b = numpy.array([[ 0., 1.],
[ 2., 3.],
[ 4., 5.]])

We can now construct a single array containing all the data:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[ 0., 1., 2., 3., 4., 5., 0., 1.],
# [ 6., 7., 8., 9., 10., 11., 2., 3.],
# [ 12., 13., 14., 15., 16., 17., 4., 5.]])

Now we create views simulating the original a and b:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.

This solution could be adapted to the case that a and b have different dtypes.

Sorting multiple lists together in place

I think "without creating temporary objects" is impossible, especially since "everything is an object" in Python.

You could get O(1) space / number of objects if you implement some sorting algorithm yourself, though if you want O(n log n) time and stability, it's difficult. If you don't care about stability (seems likely, since you say you want to sort by a but then actually sort by a, b and c), heapsort is reasonably easy:

def sort_together_heapsort(a, b, c):
n = len(a)
def swap(i, j):
a[i], a[j] = a[j], a[i]
b[i], b[j] = b[j], b[i]
c[i], c[j] = c[j], c[i]
def siftdown(i):
while (kid := 2*i+1) < n:
imax = kid if a[kid] > a[i] else i
kid += 1
if kid < n and a[kid] > a[imax]:
imax = kid
if imax == i:
return
swap(i, imax)
i = imax
for i in range(n // 2)[::-1]:
siftdown(i)
while n := n - 1:
swap(0, n)
siftdown(0)

Anyway, if someone's interested in just saving some amount of memory, that can be done by decorating in-place (building tuples and storing them in a):

def sort_together_decorate_in_a(a, b, c):
for i, a[i] in enumerate(zip(a, b, c)):
pass
a.sort()
for i, [a[i], b[i], c[i]] in enumerate(a):
pass

Or if you trust that list.sort will ask for keys for the elements in order (at least in CPython it does, already did so when the key parameter was introduced 18 years ago, and I suspect will keep doing so):

def sort_together_iter_key(a, b, c):
it = iter(a)
b.sort(key=lambda _: next(it))
it = iter(a)
c.sort(key=lambda _: next(it))
a.sort()

Testing memory and time with three lists of 100,000 elements:

15,072,520 bytes   152 ms  sort_together_sorted_zip
15,072,320 bytes 166 ms sort_together_sorted_zip_2
14,272,576 bytes 152 ms sort_together_sorted_zip_X
6,670,708 bytes 126 ms sort_together_decorate_in_a
6,670,772 bytes 177 ms sort_together_decorate_in_first_X
5,190,212 bytes 342 ms sort_multi_by_a_guest_X
1,597,400 bytes 100 ms sort_together_iter_key
1,597,448 bytes 102 ms sort_together_iter_key_X
744 bytes 1584 ms sort_together_heapsort
704 bytes 1663 ms sort_together_heapsort_X
168 bytes 1326 ms sort_together_heapsort_opti
188 bytes 1512 ms sort_together_heapsort_opti_X

Note:

  • The second solution is a shortened/improved version of yours, no need for temporary variables and conversions to lists.
  • The solutions with _X suffix are versions that take arbitrarily many lists as parameters.
  • The @a_guest is from their answer. Runtime-wise it currently benefits from my data being random, as that doesn't expose that solution's worst case complexity O(m * n²), where m is the number of lists and n is the length of each list.

Testing memory and time with ten lists of 100,000 elements:

19,760,808 bytes   388 ms  sort_together_sorted_zip_X
12,159,100 bytes 425 ms sort_together_decorate_in_first_X
5,190,292 bytes 1249 ms sort_multi_by_a_guest_X
1,597,528 bytes 393 ms sort_together_iter_key_X
704 bytes 4186 ms sort_together_heapsort_X
188 bytes 4032 ms sort_together_heapsort_opti_X

The whole code (Try it online!):

import tracemalloc as tm
from random import random
from timeit import timeit

def sort_together_sorted_zip(a, b, c):
a_sorted, b_sorted, c_sorted = map(list, zip(*sorted(zip(a, b, c))))
a[:] = a_sorted
b[:] = b_sorted
c[:] = c_sorted

def sort_together_sorted_zip_2(a, b, c):
a[:], b[:], c[:] = zip(*sorted(zip(a, b, c)))

def sort_together_sorted_zip_X(*lists):
sorteds = zip(*sorted(zip(*lists)))
for lst, lst[:] in zip(lists, sorteds):
pass

def sort_together_decorate_in_a(a, b, c):
for i, a[i] in enumerate(zip(a, b, c)):
pass
a.sort()
for i, [a[i], b[i], c[i]] in enumerate(a):
pass

def sort_together_decorate_in_first_X(*lists):
first = lists[0]
for i, first[i] in enumerate(zip(*lists)):
pass
first.sort()
for i, values in enumerate(first):
for lst, lst[i] in zip(lists, values):
pass

def sort_together_iter_key(a, b, c):
it = iter(a)
b.sort(key=lambda _: next(it))
it = iter(a)
c.sort(key=lambda _: next(it))
a.sort()

def sort_together_iter_key_X(*lists):
for lst in lists[1:]:
it = iter(lists[0])
lst.sort(key=lambda _: next(it))
lists[0].sort()

def sort_together_heapsort(a, b, c):
n = len(a)
def swap(i, j):
a[i], a[j] = a[j], a[i]
b[i], b[j] = b[j], b[i]
c[i], c[j] = c[j], c[i]
def siftdown(i):
while (kid := 2*i+1) < n:
imax = kid if a[kid] > a[i] else i
kid += 1
if kid < n and a[kid] > a[imax]:
imax = kid
if imax == i:
return
swap(i, imax)
i = imax
for i in range(n // 2)[::-1]:
siftdown(i)
while n := n - 1:
swap(0, n)
siftdown(0)

def sort_together_heapsort_X(*lists):
a = lists[0]
n = len(a)
def swap(i, j):
for lst in lists:
lst[i], lst[j] = lst[j], lst[i]
def siftdown(i):
while (kid := 2*i+1) < n:
imax = kid if a[kid] > a[i] else i
kid += 1
if kid < n and a[kid] > a[imax]:
imax = kid
if imax == i:
return
swap(i, imax)
i = imax
for i in range(n // 2)[::-1]:
siftdown(i)
while n := n - 1:
swap(0, n)
siftdown(0)

def sort_together_heapsort_opti(a, b, c):
# Avoid inner functions and range-loop to minimize memory.
# Makes it faster, too. But duplicates code. Not recommended.
n = len(a)
i0 = n // 2 - 1
while i0 >= 0:
i = i0
while (kid := 2*i+1) < n:
imax = kid if a[kid] > a[i] else i
kid += 1
if kid < n and a[kid] > a[imax]:
imax = kid
if imax == i:
break
a[i], a[imax] = a[imax], a[i]
b[i], b[imax] = b[imax], b[i]
c[i], c[imax] = c[imax], c[i]
i = imax
i0 -= 1
while n := n - 1:
a[0], a[n] = a[n], a[0]
b[0], b[n] = b[n], b[0]
c[0], c[n] = c[n], c[0]
i = 0
while (kid := 2*i+1) < n:
imax = kid if a[kid] > a[i] else i
kid += 1
if kid < n and a[kid] > a[imax]:
imax = kid
if imax == i:
break
a[i], a[imax] = a[imax], a[i]
b[i], b[imax] = b[imax], b[i]
c[i], c[imax] = c[imax], c[i]
i = imax

def sort_together_heapsort_opti_X(*lists):
# Avoid inner functions and range-loop to minimize memory.
# Makes it faster, too. But duplicates code. Not recommended.
a = lists[0]
n = len(a)
i0 = n // 2 - 1
while i0 >= 0:
i = i0
while (kid := 2*i+1) < n:
imax = kid if a[kid] > a[i] else i
kid += 1
if kid < n and a[kid] > a[imax]:
imax = kid
if imax == i:
break
for lst in lists:
lst[i], lst[imax] = lst[imax], lst[i]
i = imax
i0 -= 1
while n := n - 1:
for lst in lists:
lst[0], lst[n] = lst[n], lst[0]
i = 0
while (kid := 2*i+1) < n:
imax = kid if a[kid] > a[i] else i
kid += 1
if kid < n and a[kid] > a[imax]:
imax = kid
if imax == i:
break
for lst in lists:
lst[i], lst[imax] = lst[imax], lst[i]
i = imax

def sort_multi_by_a_guest_X(a, *lists):
indices = list(range(len(a)))
indices.sort(key=lambda i: a[i])
a.sort()
for lst in lists:
for i, j in enumerate(indices):
while j < i:
j = indices[j]
lst[i], lst[j] = lst[j], lst[i]

funcs = [
sort_together_sorted_zip,
sort_together_sorted_zip_2,
sort_together_sorted_zip_X,
sort_together_decorate_in_a,
sort_together_decorate_in_first_X,
sort_multi_by_a_guest_X,
sort_together_iter_key,
sort_together_iter_key_X,
sort_together_heapsort,
sort_together_heapsort_X,
sort_together_heapsort_opti,
sort_together_heapsort_opti_X,
]

n = 100000
a0 = [random() for _ in range(n)]
b0 = [x + 1 for x in a0]
c0 = [x + 2 for x in a0]

for _ in range(3):
for func in funcs:

a, b, c = a0[:], b0[:], c0[:]
time = timeit(lambda: func(a, b, c), number=1)
assert a == sorted(a0)
assert b == sorted(b0)
assert c == sorted(c0)

a, b, c = a0[:], b0[:], c0[:]
tm.start()
func(a, b, c)
memory = tm.get_traced_memory()[1]
tm.stop()

print(f'{memory:10,} bytes {int(time * 1e3):4} ms {func.__name__}')
print()


Related Topics



Leave a reply



Submit