How Does Condensed Distance Matrix Work? (Pdist)

How does condensed distance matrix work? (pdist)

You can look at it this way: Suppose x is m by n. The possible pairs of m rows, chosen two at a time, is itertools.combinations(range(m), 2), e.g, for m=3:

>>> import itertools
>>> list(combinations(range(3),2))
[(0, 1), (0, 2), (1, 2)]

So if d = pdist(x), the kth tuple in combinations(range(m), 2)) gives the indices of the rows of x associated with d[k].

Example:

>>> x = array([[0,10],[10,10],[20,20]])
>>> pdist(x)
array([ 10.        ,  22.36067977,  14.14213562])

The first element is dist(x[0], x[1]), the second is dist(x[0], x[2]) and the third is dist(x[1], x[2]).

Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array.

E.g.

>>> squareform(pdist(x)) 
array([[  0.   ,  10.   ,  22.361],
       [ 10.   ,   0.   ,  14.142],
       [ 22.361,  14.142,   0.   ]])

>>> y = array([[0,10],[10,10],[20,20],[10,0]])
>>> squareform(pdist(y)) 
array([[  0.   ,  10.   ,  22.361,  14.142],
       [ 10.   ,   0.   ,  14.142,  10.   ],
       [ 22.361,  14.142,   0.   ,  22.361],
       [ 14.142,  10.   ,  22.361,   0.   ]])
>>> pdist(y)
array([ 10.   ,  22.361,  14.142,  14.142,  10.   ,  22.361])

What is the difference between condensed and redundant distance matrices?

If you have a nxn matrix then each pairwise combination from the set N exists twice, once in each order, ab and ba. So if you create a distance matrix from a set of N points you can condense the data by only storing each point once, and neglecting any comparisons between points and themselves.

for example if we have the points a, b, and c we would have the distance matrix

    a    b    c
a   0    ab   ac
b   ba   0    bc
c   ca   cb   0

and the condensed distance matrix,

    a    b    c
         ab   ac
              bc

Because distance masers are unsigned the condensed table retains all the information.

Is the format/structure of SciPy's condensed distance matrix stable?

Honestly, this is a better question for the scipy users or dev list, as it's about future plans for scipy.

However, the structure is fairly rigorously documented in the docstrings for both scipy.spatial.pdist and in scipy.spatial.squareform.

E.g. for pdist:

Returns a condensed distance matrix Y.  For
    each :math:`i` and :math:`j` (where :math:`i<j<n`), the
    metric ``dist(u=X[i], v=X[j])`` is computed and stored in the
    :math:`ij`th entry.

    See ``squareform`` for information on how to calculate the index of
    this entry or to convert the condensed distance matrix to a
    redundant square matrix.

Becuase of this, and the fact that so many other functions in scipy.spatial expect a distance matrix in this form, I'd seriously doubt it's going to change without a number of depreciation warnings and announcements.

Modules in scipy itself (as opposed to scipy's scikits) are fairly stable, and there's a great deal of consideration put into backwards compatibility when changes are made (and because of this, there's quite a bit of legacy "cruft" in scipy: e.g. the fact that the core scipy module is just numpy with different defaults on a couple of functions.).

Calculate condensed distance matrix with varying length data points

If I understand correctly, you want to compute the distances using awarp, but that distance function takes signals of varying length. So you need to avoid creating an array, because NumPy doesn't allow 'ragged' arrays. Then I think you can do this:

from itertools import combinations
from scipy.spatial.distance import squareform

# Example distance function.
def dfun(u, v):
    return u.sum() + v.sum()

dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
data = [dat0, dat1, dat2]

dists = [dfun(a, b) for a, b in combinations(data, r=2)]
squareform(dists)

For your example, this yields:

array([[ 0, -4, -8],
       [-4,  0, -6],
       [-8, -6,  0]])

And if dfun = awarp then you get this output for those signals:

array([[ 0.        ,  0.        ,  2.23606798],
       [ 0.        ,  0.        ,  2.44948974],
       [ 2.23606798,  2.44948974,  0.        ]])

I guess this approach only works if dfun is commutative, which I think awarp is.

Convert scipy condensed distance matrix to lower matrix read by rows

Here's a quick implementation--but it creates the square redundant distance matrix as an intermediate step:

In [128]: import numpy as np

In [129]: from scipy.spatial.distance import squareform

c is the condensed form of the distance matrix:

In [130]: c = np.array([1, 2, 3, 4, 5, 6])

d is the redundant square distance matrix:

In [131]: d = squareform(c)

Here's your condensed lower triangle distances:

In [132]: d[np.tril_indices(d.shape[0], -1)]
Out[132]: array([1, 2, 4, 3, 5, 6])

Here's a method that avoids forming the redundant distance matrix. The function condensed_index(i, j, n) takes the row i and column j of the redundant distance matrix, with j > i, and returns the corresponding index in the condensed distance array.

In [169]: def condensed_index(i, j, n):
     ...:     return n*i - i*(i+1)//2 + j - i - 1
     ...:

As above, c is the condensed distance array.

In [170]: c
Out[170]: array([1, 2, 3, 4, 5, 6])

In [171]: n = 4

In [172]: i, j = np.tril_indices(n, -1)

Note that the arguments are reversed in the following call:

In [173]: indices = condensed_index(j, i, n)

indices gives the desired permutation of the condensed distance array.

In [174]: c[indices]
Out[174]: array([1, 2, 4, 3, 5, 6])

(Basically the same function as condensed_index(i, j, n) was given in several answers to this question.)

python how to get proper distance value out of scipy condensed distance matrix

This vector is in condensed form. It enumerates all pairs of indices in a natural order (in your example 0,1 0,2 0,3 0,4 1,2 1,3 1,4 2,3 2,4 ) and yields the distance between the elements at these array entries.

There is also the squareform function, which transforms the condensed form into a square matrix form (and vice versa). The square matrix form is exactly what you expect, i.e. at entry ij (row i, column j), it stores the distance between the i-th and j-th entry. For example, if you add print squareform(d) at the end of you code, the output will be:

array([[ 0.,  3.,  1.,  4.],
       [ 3.,  0.,  4.,  1.],
       [ 1.,  4.,  0.,  5.],
       [ 4.,  1.,  5.,  0.]])

How Does Condensed Distance Matrix Work? (Pdist)