How does condensed distance matrix work? (pdist)
You can look at it this way: Suppose x
is m by n. The possible pairs of m
rows, chosen two at a time, is itertools.combinations(range(m), 2)
, e.g, for m=3
:
>>> import itertools
>>> list(combinations(range(3),2))
[(0, 1), (0, 2), (1, 2)]
So if d = pdist(x)
, the k
th tuple in combinations(range(m), 2))
gives the indices of the rows of x
associated with d[k]
.
Example:
>>> x = array([[0,10],[10,10],[20,20]])
>>> pdist(x)
array([ 10. , 22.36067977, 14.14213562])
The first element is dist(x[0], x[1])
, the second is dist(x[0], x[2])
and the third is dist(x[1], x[2])
.
Or you can view it as the elements in the upper triangular part of the square distance matrix, strung together into a 1D array.
E.g.
>>> squareform(pdist(x))
array([[ 0. , 10. , 22.361],
[ 10. , 0. , 14.142],
[ 22.361, 14.142, 0. ]])
>>> y = array([[0,10],[10,10],[20,20],[10,0]])
>>> squareform(pdist(y))
array([[ 0. , 10. , 22.361, 14.142],
[ 10. , 0. , 14.142, 10. ],
[ 22.361, 14.142, 0. , 22.361],
[ 14.142, 10. , 22.361, 0. ]])
>>> pdist(y)
array([ 10. , 22.361, 14.142, 14.142, 10. , 22.361])
What is the difference between condensed and redundant distance matrices?
If you have a nxn matrix then each pairwise combination from the set N exists twice, once in each order, ab and ba. So if you create a distance matrix from a set of N points you can condense the data by only storing each point once, and neglecting any comparisons between points and themselves.
for example if we have the points a, b, and c we would have the distance matrix
a b c
a 0 ab ac
b ba 0 bc
c ca cb 0
and the condensed distance matrix,
a b c
ab ac
bc
Because distance masers are unsigned the condensed table retains all the information.
Is the format/structure of SciPy's condensed distance matrix stable?
Honestly, this is a better question for the scipy users or dev list, as it's about future plans for scipy.
However, the structure is fairly rigorously documented in the docstrings for both scipy.spatial.pdist
and in scipy.spatial.squareform
.
E.g. for pdist
:
Returns a condensed distance matrix Y. For
each :math:`i` and :math:`j` (where :math:`i<j<n`), the
metric ``dist(u=X[i], v=X[j])`` is computed and stored in the
:math:`ij`th entry.
See ``squareform`` for information on how to calculate the index of
this entry or to convert the condensed distance matrix to a
redundant square matrix.
Becuase of this, and the fact that so many other functions in scipy.spatial
expect a distance matrix in this form, I'd seriously doubt it's going to change without a number of depreciation warnings and announcements.
Modules in scipy
itself (as opposed to scipy's scikits
) are fairly stable, and there's a great deal of consideration put into backwards compatibility when changes are made (and because of this, there's quite a bit of legacy "cruft" in scipy
: e.g. the fact that the core scipy
module is just numpy with different defaults on a couple of functions.).
Calculate condensed distance matrix with varying length data points
If I understand correctly, you want to compute the distances using awarp
, but that distance function takes signals of varying length. So you need to avoid creating an array, because NumPy doesn't allow 'ragged' arrays. Then I think you can do this:
from itertools import combinations
from scipy.spatial.distance import squareform
# Example distance function.
def dfun(u, v):
return u.sum() + v.sum()
dat0 = np.array([-1, 1,-4, 1])
dat1 = np.array([-1, 1,-3, 1, 1])
dat2 = np.array([ 1,-6])
data = [dat0, dat1, dat2]
dists = [dfun(a, b) for a, b in combinations(data, r=2)]
squareform(dists)
For your example, this yields:
array([[ 0, -4, -8],
[-4, 0, -6],
[-8, -6, 0]])
And if dfun = awarp
then you get this output for those signals:
array([[ 0. , 0. , 2.23606798],
[ 0. , 0. , 2.44948974],
[ 2.23606798, 2.44948974, 0. ]])
I guess this approach only works if dfun
is commutative, which I think awarp
is.
Convert scipy condensed distance matrix to lower matrix read by rows
Here's a quick implementation--but it creates the square redundant distance matrix as an intermediate step:
In [128]: import numpy as np
In [129]: from scipy.spatial.distance import squareform
c
is the condensed form of the distance matrix:
In [130]: c = np.array([1, 2, 3, 4, 5, 6])
d
is the redundant square distance matrix:
In [131]: d = squareform(c)
Here's your condensed lower triangle distances:
In [132]: d[np.tril_indices(d.shape[0], -1)]
Out[132]: array([1, 2, 4, 3, 5, 6])
Here's a method that avoids forming the redundant distance matrix. The function condensed_index(i, j, n)
takes the row i
and column j
of the redundant distance matrix, with j
> i
, and returns the corresponding index in the condensed distance array.
In [169]: def condensed_index(i, j, n):
...: return n*i - i*(i+1)//2 + j - i - 1
...:
As above, c
is the condensed distance array.
In [170]: c
Out[170]: array([1, 2, 3, 4, 5, 6])
In [171]: n = 4
In [172]: i, j = np.tril_indices(n, -1)
Note that the arguments are reversed in the following call:
In [173]: indices = condensed_index(j, i, n)
indices
gives the desired permutation of the condensed distance array.
In [174]: c[indices]
Out[174]: array([1, 2, 4, 3, 5, 6])
(Basically the same function as condensed_index(i, j, n)
was given in several answers to this question.)
python how to get proper distance value out of scipy condensed distance matrix
This vector is in condensed form. It enumerates all pairs of indices in a natural order (in your example 0,1
0,2
0,3
0,4
1,2
1,3
1,4
2,3
2,4
) and yields the distance between the elements at these array entries.
There is also the squareform function, which transforms the condensed form into a square matrix form (and vice versa). The square matrix form is exactly what you expect, i.e. at entry ij (row i, column j), it stores the distance between the i-th and j-th entry. For example, if you add print squareform(d)
at the end of you code, the output will be:
array([[ 0., 3., 1., 4.],
[ 3., 0., 4., 1.],
[ 1., 4., 0., 5.],
[ 4., 1., 5., 0.]])
Related Topics
Index of Duplicates Items in a Python List
How to Create a "View" on a Python List
Paramiko Ssh Die/Hang with Big Output
Multiple Variables in a 'With' Statement
How to Apply Piecewise Linear Fit in Python
Getting Computer's Utc Offset in Python
Adding a Background Image to a Plot
How to Convert a Time.Struct_Time Object into a Datetime Object
How to Convert CSV File to Multiline JSON
Remove Trailing Newline from the Elements of a String List
How to Make File Creation an Atomic Operation
Getting a Callback When a Tkinter Listbox Selection Is Changed
Filling in Login Forms in Instagram Using Selenium and Webdriver (Chrome) Python Osx
How to Set Class Attributes from Variable Arguments (Kwargs) in Python
Could Not Find a Version That Satisfies the Requirement Tensorflow
Inserting Line at Specified Position of a Text File
What Is the Most Pythonic Way to Pop a Random Element from a List
How to Set Up a Virtual Environment for Python in Visual Studio Code