Fast Haversine Approximation (Python/Pandas)

Haversine Distance between consecutive rows for each Customer

I'll reuse the vectorized haversine_np function from derricw's answer:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

def distance(x):
    y = x.shift()
    return haversine_np(x['Lat'], x['Lon'], y['Lat'], y['Lon']).fillna(0)

df['Distance'] = df.groupby('Customer').apply(distance).reset_index(level=0, drop=True)

Result:

  Customer  Lat  Lon    Distance
0        A    1    2    0.000000
1        A    1    2    0.000000
2        B    3    2    0.000000
3        B    4    2  111.057417

Pandas: calculate haversine distance within each group of rows

Try this approach:

import pandas as pd
import numpy as np

# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')

# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

Now we can calculate distances between coordinates belonging to the same id (group):

df['dist'] = \
    np.concatenate(df.groupby('id')
                     .apply(lambda x: haversine(x['lat'], x['lon'],
                                                x['lat'].shift(), x['lon'].shift())).values)

Result:

In [105]: df
Out[105]:
    id        lat        lon       dist
0    1  19.111841  72.910729        NaN
1    1  19.111342  72.908387   0.252243
2    2  19.111542  72.907387        NaN
3    2  19.137815  72.914085   3.004976
4    2  19.119677  72.905081   2.227658
5    2  19.129677  72.905081   1.111949
6    3  19.319677  72.905081        NaN
7    3  19.120217  72.907121  22.179974
8    4  19.420217  72.807121        NaN
9    4  19.520217  73.307121  53.584504
10   5  19.319677  72.905081        NaN
11   5  19.419677  72.805081  15.286775
12   5  19.629677  72.705081  25.594890
13   5  19.111860  72.911347  61.509917
14   5  19.111860  72.931346   2.101215
15   5  19.219677  72.605081  36.304756
16   6  19.319677  72.805082        NaN
17   6  19.419677  72.905086  15.287063

Faster way to calculate distance from coordinates on DataFrame

Using distance() from GeoPy:

from geopy import distance

pd.merge(df2.rename(columns={'Latitude':'C-Lat','Longitude':'C-Lon'}), df1, how='cross') \
    .assign(Distance=lambda r: \
        r.apply(lambda x: distance.distance((x['C-Lat'],x['C-Lon']),(x['Latitude'],x['Longitude'])).miles, axis=1)
           ).drop(columns=['C-Lat','C-Lon'])

               City                Building   Latitude  Longitude     Distance
0  Santa Barbra, CA  One World Trade Center  40.713005 -74.013190  2464.602573
1  Santa Barbra, CA      Central Park Tower  40.765957 -73.980844  2466.054087
2  Santa Barbra, CA            Willis Tower  41.878872 -87.635908  1759.257288
3  Santa Barbra, CA    111 West 57th Street  40.764760 -73.977581  2466.230348
4  Santa Barbra, CA          One Vanderbilt  40.752971 -73.978541  2466.233832
5  Washington, D.C.  One World Trade Center  40.713005 -74.013190   203.461336
6  Washington, D.C.      Central Park Tower  40.765957 -73.980844   207.017141
7  Washington, D.C.            Willis Tower  41.878872 -87.635908   595.065660
8  Washington, D.C.    111 West 57th Street  40.764760 -73.977581   207.103384
9  Washington, D.C.          One Vanderbilt  40.752971 -73.978541   206.571970

What is the fastest way to generate a matrix for distances between location with lat and lon?

First of all, do you need to use haversine metric for distance calculation? Which implementation do you use? If you would use e.g. euclidean metric your calculation would be faster but I guess you have good reasons why did you choose this metric.

In that case it may be better to use more optimal implementation of haversine (but I do not know which implementation you use). Check e.g. this SO question.

I guess you are using pdist and squareform from scipy.spatial.distance. When you look at the implementation that is behind (here) you will find they are using for loop. In that case you could rather use some vectorized implementation (e.g. this one from the linked question above).

import numpy as np
import itertools
from scipy.spatial.distance import pdist, squareform
from haversine import haversine  # pip install haversine

# original approach
place_coordinates = [(x, y) for x in range(10) for y in range(10)]
d = pdist(place_coordinates, metric=haversine)

# approach using combinations
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
d2 = [haversine(x, y) for (x, y) in place_coordinates_comb]

# just ensure that using combinations give you the same results as using pdist
np.testing.assert_array_equal(d, d2)

# vectorized version (taken from the link above)
# 1) create combination (note that haversine implementation from the link above takes (lon1, lat1, lon2, lat2) as arguments, that's why we do flatten
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
place_coordinates_comb_flatten = [(*x, *y) for (x, y) in place_coordinates_comb]
# 2) use format required by this impl
lon1, lat1, lon2, lat2 = np.array(place_coordinates_comb_flatten).T
# 3) vectorized comp
d_vect = haversine_np(lon1, lat1, lon2, lat2)

# it slightly differs from the original haversine package, but it's ok imo and vectorized implementation can be ofc improve to return exactly the same results
np.testing.assert_array_equal(d, d_vect)

When you compare times (absolute numbers will differ based on used machine):

%timeit pdist(place_coordinates, metric=haversine)
# 15.7 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit haversine_np(lon1, lat1, lon2, lat2)
# 241 µs ± 7.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

That's quite a lot (~60x faster). When you have really long array (how many coordinates are you using?) this can help you a lot.

Finally, you can combine it using your code:

place_correlation = pd.DataFrame(squareform(d_vect), index=place_coordinates, columns=place_coordinates)

Additional improvement could be to use another metric (e.g. euclidean that will be faster) to quickly say which distances are outside 10km and then calculate haversine for the rest.

Fast Haversine Approximation (Python/Pandas)

Haversine Distance between consecutive rows for each Customer

Pandas: calculate haversine distance within each group of rows

Faster way to calculate distance from coordinates on DataFrame

What is the fastest way to generate a matrix for distances between location with lat and lon?

Related Topics

Leave a reply