How to "Zip Sort" Parallel Numpy Arrays

How can I zip sort parallel numpy arrays?

b[a.argsort()] should do the trick.

Here's how it works. First you need to find a permutation that sorts a. argsort is a method that computes this:

>>> a = numpy.array([2, 3, 1])
>>> p = a.argsort()
>>> p
[2, 0, 1]

You can easily check that this is right:

>>> a[p]
array([1, 2, 3])

Now apply the same permutation to b.

>>> b = numpy.array([4, 6, 7])
>>> b[p]
array([7, 4, 6])

Sort two numpy matrices in parallel, row by row

You can use advanced indexing -

idxx = np.arange(a.shape[0])[:,None],a.argsort(1)
a_out = a[idxx]
b_out = b[idxx]

Sample run -

In [75]: a
Out[75]: 
array([['b', 'c', 'd', 'e'],
       ['a', 'b', 'd', 'e']], 
      dtype='|S1')

In [76]: b
Out[76]: 
array([['2', '1', '4', '3'],
       ['2', '4', '1', '3']], 
      dtype='|S1')

In [77]: a_out
Out[77]: 
array([['b', 'c', 'd', 'e'],
       ['a', 'b', 'd', 'e']], 
      dtype='|S1')

In [78]: b_out
Out[78]: 
array([['2', '1', '4', '3'],
       ['2', '4', '1', '3']], 
      dtype='|S1')

Sort a numpy array by another array, along a particular axis, using less memory

Would a record array serve your purposes?

>>> a = numpy.zeros((3, 3, 3))
>>> a += numpy.array((1, 3, 2)).reshape((3, 1, 1))
>>> b = numpy.arange(3*3*3).reshape((3, 3, 3))
>>> c = numpy.array(zip(a.flatten(), b.flatten()), dtype=[('f', float), ('i', int)]).reshape(3, 3, 3)
>>> c.sort(axis=0)
>>> c['i']
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]]])

A cleaner way to generate the coupled array:

>>> c = numpy.rec.fromarrays([a, b], dtype=[('f', float), ('i', int)])

>>> c = numpy.rec.fromarrays([a, b], names='f, i')

Sort a numpy array by another array, along a particular axis

You still have to supply indices for the other two dimensions for this to work correctly.

>>> a = numpy.zeros((3, 3, 3))
>>> a += numpy.array((1, 3, 2)).reshape((3, 1, 1))
>>> b = numpy.arange(3*3*3).reshape((3, 3, 3))
>>> sort_indices = numpy.argsort(a, axis=0)
>>> static_indices = numpy.indices((3, 3, 3))
>>> b[sort_indices, static_indices[1], static_indices[2]]
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]]])

numpy.indices calculates the indices of each axis of the array when "flattened" through the other two axes (or n - 1 axes where n = total number of axes). In other words, this (apologies for the long post):

>>> static_indices
array([[[[0, 0, 0],
         [0, 0, 0],
         [0, 0, 0]],

        [[1, 1, 1],
         [1, 1, 1],
         [1, 1, 1]],

        [[2, 2, 2],
         [2, 2, 2],
         [2, 2, 2]]],

       [[[0, 0, 0],
         [1, 1, 1],
         [2, 2, 2]],

        [[0, 0, 0],
         [1, 1, 1],
         [2, 2, 2]],

        [[0, 0, 0],
         [1, 1, 1],
         [2, 2, 2]]],

       [[[0, 1, 2],
         [0, 1, 2],
         [0, 1, 2]],

        [[0, 1, 2],
         [0, 1, 2],
         [0, 1, 2]],

        [[0, 1, 2],
         [0, 1, 2],
         [0, 1, 2]]]])

These are the identity indices for each axis; when used to index b, they recreate b.

>>> b[static_indices[0], static_indices[1], static_indices[2]]
array([[[ 0,  1,  2],
        [ 3,  4,  5],
        [ 6,  7,  8]],

       [[ 9, 10, 11],
        [12, 13, 14],
        [15, 16, 17]],

       [[18, 19, 20],
        [21, 22, 23],
        [24, 25, 26]]])

As an alternative to numpy.indices, you could use numpy.ogrid, as unutbu suggests. Since the object generated by ogrid is smaller, I'll create all three axes, just for consistency sake, but note unutbu's comment for a way to do this by generating only two.

>>> static_indices = numpy.ogrid[0:a.shape[0], 0:a.shape[1], 0:a.shape[2]]
>>> a[sort_indices, static_indices[1], static_indices[2]]
array([[[ 1.,  1.,  1.],
        [ 1.,  1.,  1.],
        [ 1.,  1.,  1.]],

       [[ 2.,  2.,  2.],
        [ 2.,  2.,  2.],
        [ 2.,  2.,  2.]],

       [[ 3.,  3.,  3.],
        [ 3.,  3.,  3.],
        [ 3.,  3.,  3.]]])

How to sort two lists (which reference each other) in the exact same way

One classic approach to this problem is to use the "decorate, sort, undecorate" idiom, which is especially simple using python's built-in zip function:

>>> list1 = [3,2,4,1, 1]
>>> list2 = ['three', 'two', 'four', 'one', 'one2']
>>> list1, list2 = zip(*sorted(zip(list1, list2)))
>>> list1
(1, 1, 2, 3, 4)
>>> list2 
('one', 'one2', 'two', 'three', 'four')

These of course are no longer lists, but that's easily remedied, if it matters:

>>> list1, list2 = (list(t) for t in zip(*sorted(zip(list1, list2))))
>>> list1
[1, 1, 2, 3, 4]
>>> list2
['one', 'one2', 'two', 'three', 'four']

It's worth noting that the above may sacrifice speed for terseness; the in-place version, which takes up 3 lines, is a tad faster on my machine for small lists:

>>> %timeit zip(*sorted(zip(list1, list2)))
100000 loops, best of 3: 3.3 us per loop
>>> %timeit tups = zip(list1, list2); tups.sort(); zip(*tups)
100000 loops, best of 3: 2.84 us per loop

On the other hand, for larger lists, the one-line version could be faster:

>>> %timeit zip(*sorted(zip(list1, list2)))
100 loops, best of 3: 8.09 ms per loop
>>> %timeit tups = zip(list1, list2); tups.sort(); zip(*tups)
100 loops, best of 3: 8.51 ms per loop

As Quantum7 points out, JSF's suggestion is a bit faster still, but it will probably only ever be a little bit faster, because Python uses the very same DSU idiom internally for all key-based sorts. It's just happening a little closer to the bare metal. (This shows just how well optimized the zip routines are!)

I think the zip-based approach is more flexible and is a little more readable, so I prefer it.

Note that when elements of list1 are equal, this approach will end up comparing elements of list2. If elements of list2 don't support comparison, or don't produce a boolean when compared (for example, if list2 is a list of NumPy arrays), this will fail, and if elements of list2 are very expensive to compare, it might be better to avoid comparison anyway.

In that case, you can sort indices as suggested in jfs's answer, or you can give the sort a key function that avoids comparing elements of list2:

result1, result2 = zip(*sorted(zip(list1, list2), key=lambda x: x[0]))

Also, the use of zip(*...) as a transpose fails when the input is empty. If your inputs might be empty, you will have to handle that case separately.

Sort randomly generated Numpy array according to a different array

One way is to first calculate an ordering via argsort, then use this to index your input arrays::

import numpy as np

np.random.seed(0)

ages = np.random.randint(18, 40, size=10)   # [30 33 39 18 21 21 25 27 37 39]
marks = np.random.randint(0, 100, size=10)  # [36 87 70 88 88 12 58 65 39 87]

order = ages.argsort()                      # [3 4 5 6 7 0 1 8 2 9]

print(ages[order])                          # [18 21 21 25 27 30 33 37 39 39]
print(marks[order])                         # [88 88 12 58 65 36 87 39 70 87]

Better way to shuffle two numpy arrays in unison

Your "scary" solution does not appear scary to me. Calling shuffle() for two sequences of the same length results in the same number of calls to the random number generator, and these are the only "random" elements in the shuffle algorithm. By resetting the state, you ensure that the calls to the random number generator will give the same results in the second call to shuffle(), so the whole algorithm will generate the same permutation.

If you don't like this, a different solution would be to store your data in one array instead of two right from the beginning, and create two views into this single array simulating the two arrays you have now. You can use the single array for shuffling and the views for all other purposes.

Example: Let's assume the arrays a and b look like this:

a = numpy.array([[[  0.,   1.,   2.],
                  [  3.,   4.,   5.]],

                 [[  6.,   7.,   8.],
                  [  9.,  10.,  11.]],

                 [[ 12.,  13.,  14.],
                  [ 15.,  16.,  17.]]])

b = numpy.array([[ 0.,  1.],
                 [ 2.,  3.],
                 [ 4.,  5.]])

We can now construct a single array containing all the data:

c = numpy.c_[a.reshape(len(a), -1), b.reshape(len(b), -1)]
# array([[  0.,   1.,   2.,   3.,   4.,   5.,   0.,   1.],
#        [  6.,   7.,   8.,   9.,  10.,  11.,   2.,   3.],
#        [ 12.,  13.,  14.,  15.,  16.,  17.,   4.,   5.]])

Now we create views simulating the original a and b:

a2 = c[:, :a.size//len(a)].reshape(a.shape)
b2 = c[:, a.size//len(a):].reshape(b.shape)

The data of a2 and b2 is shared with c. To shuffle both arrays simultaneously, use numpy.random.shuffle(c).

In production code, you would of course try to avoid creating the original a and b at all and right away create c, a2 and b2.

This solution could be adapted to the case that a and b have different dtypes.

Sorting multiple lists together in place

I think "without creating temporary objects" is impossible, especially since "everything is an object" in Python.

You could get O(1) space / number of objects if you implement some sorting algorithm yourself, though if you want O(n log n) time and stability, it's difficult. If you don't care about stability (seems likely, since you say you want to sort by a but then actually sort by a, b and c), heapsort is reasonably easy:

def sort_together_heapsort(a, b, c):
    n = len(a)
    def swap(i, j):
        a[i], a[j] = a[j], a[i]
        b[i], b[j] = b[j], b[i]
        c[i], c[j] = c[j], c[i]
    def siftdown(i):
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                return
            swap(i, imax)
            i = imax
    for i in range(n // 2)[::-1]:
        siftdown(i)
    while n := n - 1:
        swap(0, n)
        siftdown(0)

Anyway, if someone's interested in just saving some amount of memory, that can be done by decorating in-place (building tuples and storing them in a):

def sort_together_decorate_in_a(a, b, c):
    for i, a[i] in enumerate(zip(a, b, c)):
        pass
    a.sort()
    for i, [a[i], b[i], c[i]] in enumerate(a):
        pass

Or if you trust that list.sort will ask for keys for the elements in order (at least in CPython it does, already did so when the key parameter was introduced 18 years ago, and I suspect will keep doing so):

def sort_together_iter_key(a, b, c):
    it = iter(a)
    b.sort(key=lambda _: next(it))
    it = iter(a)
    c.sort(key=lambda _: next(it))
    a.sort()

Testing memory and time with three lists of 100,000 elements:

15,072,520 bytes   152 ms  sort_together_sorted_zip
15,072,320 bytes   166 ms  sort_together_sorted_zip_2
14,272,576 bytes   152 ms  sort_together_sorted_zip_X
 6,670,708 bytes   126 ms  sort_together_decorate_in_a
 6,670,772 bytes   177 ms  sort_together_decorate_in_first_X
 5,190,212 bytes   342 ms  sort_multi_by_a_guest_X
 1,597,400 bytes   100 ms  sort_together_iter_key
 1,597,448 bytes   102 ms  sort_together_iter_key_X
       744 bytes  1584 ms  sort_together_heapsort
       704 bytes  1663 ms  sort_together_heapsort_X
       168 bytes  1326 ms  sort_together_heapsort_opti
       188 bytes  1512 ms  sort_together_heapsort_opti_X

Note:

The second solution is a shortened/improved version of yours, no need for temporary variables and conversions to lists.
The solutions with _X suffix are versions that take arbitrarily many lists as parameters.
The @a_guest is from their answer. Runtime-wise it currently benefits from my data being random, as that doesn't expose that solution's worst case complexity O(m * n²), where m is the number of lists and n is the length of each list.

Testing memory and time with ten lists of 100,000 elements:

19,760,808 bytes   388 ms  sort_together_sorted_zip_X
12,159,100 bytes   425 ms  sort_together_decorate_in_first_X
 5,190,292 bytes  1249 ms  sort_multi_by_a_guest_X
 1,597,528 bytes   393 ms  sort_together_iter_key_X
       704 bytes  4186 ms  sort_together_heapsort_X
       188 bytes  4032 ms  sort_together_heapsort_opti_X

The whole code (Try it online!):

import tracemalloc as tm
from random import random
from timeit import timeit

def sort_together_sorted_zip(a, b, c):
    a_sorted, b_sorted, c_sorted = map(list, zip(*sorted(zip(a, b, c))))
    a[:] = a_sorted
    b[:] = b_sorted
    c[:] = c_sorted

def sort_together_sorted_zip_2(a, b, c):
    a[:], b[:], c[:] = zip(*sorted(zip(a, b, c)))

def sort_together_sorted_zip_X(*lists):
    sorteds = zip(*sorted(zip(*lists)))
    for lst, lst[:] in zip(lists, sorteds):
        pass

def sort_together_decorate_in_a(a, b, c):
    for i, a[i] in enumerate(zip(a, b, c)):
        pass
    a.sort()
    for i, [a[i], b[i], c[i]] in enumerate(a):
        pass

def sort_together_decorate_in_first_X(*lists):
    first = lists[0]
    for i, first[i] in enumerate(zip(*lists)):
        pass
    first.sort()
    for i, values in enumerate(first):
        for lst, lst[i] in zip(lists, values):
            pass

def sort_together_iter_key(a, b, c):
    it = iter(a)
    b.sort(key=lambda _: next(it))
    it = iter(a)
    c.sort(key=lambda _: next(it))
    a.sort()

def sort_together_iter_key_X(*lists):
    for lst in lists[1:]:
        it = iter(lists[0])
        lst.sort(key=lambda _: next(it))
    lists[0].sort()

def sort_together_heapsort(a, b, c):
    n = len(a)
    def swap(i, j):
        a[i], a[j] = a[j], a[i]
        b[i], b[j] = b[j], b[i]
        c[i], c[j] = c[j], c[i]
    def siftdown(i):
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                return
            swap(i, imax)
            i = imax
    for i in range(n // 2)[::-1]:
        siftdown(i)
    while n := n - 1:
        swap(0, n)
        siftdown(0)

def sort_together_heapsort_X(*lists):
    a = lists[0]
    n = len(a)
    def swap(i, j):
        for lst in lists:
            lst[i], lst[j] = lst[j], lst[i]
    def siftdown(i):
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                return
            swap(i, imax)
            i = imax
    for i in range(n // 2)[::-1]:
        siftdown(i)
    while n := n - 1:
        swap(0, n)
        siftdown(0)

def sort_together_heapsort_opti(a, b, c):
    # Avoid inner functions and range-loop to minimize memory.
    # Makes it faster, too. But duplicates code. Not recommended.
    n = len(a)
    i0 = n // 2 - 1
    while i0 >= 0:
        i = i0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            a[i], a[imax] = a[imax], a[i]
            b[i], b[imax] = b[imax], b[i]
            c[i], c[imax] = c[imax], c[i]
            i = imax
        i0 -= 1
    while n := n - 1:
        a[0], a[n] = a[n], a[0]
        b[0], b[n] = b[n], b[0]
        c[0], c[n] = c[n], c[0]
        i = 0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            a[i], a[imax] = a[imax], a[i]
            b[i], b[imax] = b[imax], b[i]
            c[i], c[imax] = c[imax], c[i]
            i = imax

def sort_together_heapsort_opti_X(*lists):
    # Avoid inner functions and range-loop to minimize memory.
    # Makes it faster, too. But duplicates code. Not recommended.
    a = lists[0]
    n = len(a)
    i0 = n // 2 - 1
    while i0 >= 0:
        i = i0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            for lst in lists:
                lst[i], lst[imax] = lst[imax], lst[i]
            i = imax
        i0 -= 1
    while n := n - 1:
        for lst in lists:
            lst[0], lst[n] = lst[n], lst[0]
        i = 0
        while (kid := 2*i+1) < n:
            imax = kid if a[kid] > a[i] else i
            kid += 1
            if kid < n and a[kid] > a[imax]:
                imax = kid
            if imax == i:
                break
            for lst in lists:
                lst[i], lst[imax] = lst[imax], lst[i]
            i = imax

def sort_multi_by_a_guest_X(a, *lists):
    indices = list(range(len(a)))
    indices.sort(key=lambda i: a[i])
    a.sort()
    for lst in lists:
        for i, j in enumerate(indices):
            while j < i:
                j = indices[j]
            lst[i], lst[j] = lst[j], lst[i]

funcs = [
    sort_together_sorted_zip,
    sort_together_sorted_zip_2,
    sort_together_sorted_zip_X,
    sort_together_decorate_in_a,
    sort_together_decorate_in_first_X,
    sort_multi_by_a_guest_X,
    sort_together_iter_key,
    sort_together_iter_key_X,
    sort_together_heapsort,
    sort_together_heapsort_X,
    sort_together_heapsort_opti,
    sort_together_heapsort_opti_X,
]

n = 100000
a0 = [random() for _ in range(n)]
b0 = [x + 1 for x in a0]
c0 = [x + 2 for x in a0]

for _ in range(3):
    for func in funcs:

        a, b, c = a0[:], b0[:], c0[:]
        time = timeit(lambda: func(a, b, c), number=1)
        assert a == sorted(a0)
        assert b == sorted(b0)
        assert c == sorted(c0)

        a, b, c = a0[:], b0[:], c0[:]
        tm.start()
        func(a, b, c)
        memory = tm.get_traced_memory()[1] 
        tm.stop()

        print(f'{memory:10,} bytes  {int(time * 1e3):4} ms  {func.__name__}')
    print()