Vectorised Haversine Formula with a Pandas Dataframe

Vectorised Haversine formula with a pandas dataframe

I can't confirm if the calculations are correct but the following worked:

In [11]:

from numpy import cos, sin, arcsin, sqrt
from math import radians

def haversine(row):
    lon1 = -56.7213600
    lat1 = 37.2175900
    lon2 = row['LON']
    lat2 = row['LAT']
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * arcsin(sqrt(a)) 
    km = 6367 * c
    return km

df['distance'] = df.apply(lambda row: haversine(row), axis=1)
df
Out[11]:
         SEAZ        LAT        LON     distance
index                                           
1      296.40  58.731221  28.377411  6275.791920
2      274.72  56.814832  31.292324  6509.727368
3      192.25  52.064988  35.801864  6990.144378
4       34.34  68.818875  67.193367  7357.221846
5      271.05  56.669988  31.688062  6538.047542
6      131.88  48.554622  49.782773  8036.968198
7      350.71  64.774272  31.395378  6229.733699
8      214.44  53.519292  33.845856  6801.670843
9        1.46  67.943374  38.484252  6418.754323
10     273.55  53.343731   4.471666  4935.394528

The following code is actually slower on such a small dataframe but I applied it to a 100,000 row df:

In [35]:

%%timeit
df['LAT_rad'], df['LON_rad'] = np.radians(df['LAT']), np.radians(df['LON'])
df['dLON'] = df['LON_rad'] - math.radians(-56.7213600)
df['dLAT'] = df['LAT_rad'] - math.radians(37.2175900)
df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin(df['dLAT']/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(df['LAT_rad']) * np.sin(df['dLON']/2)**2))

1 loops, best of 3: 17.2 ms per loop

Compared to the apply function which took 4.3s so nearly 250 times quicker, something to note in the future

If we compress all the above in to a one-liner:

In [39]:

%timeit df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['LAT']) - math.radians(37.2175900))/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(np.radians(df['LAT'])) * np.sin((np.radians(df['LON']) - math.radians(-56.7213600))/2)**2))
100 loops, best of 3: 12.6 ms per loop

We observe further speed ups now a factor of ~341 times quicker.

How to use haversine distance using haversine library on pandas dataframe

You can use itertools.product for creating all cases then use haversine for getting results like the below:

import haversine as hs
import pandas as pd
import numpy as np
import itertools

res = []
for a,b in (itertools.product(*[df1.values , df2.values])):
    res.append(hs.haversine(a,b))

m = int(np.sqrt(len(res)))
df = pd.DataFrame(np.asarray(res).reshape(m,m))
print(df)

Output:

            0           1           2
0  587.500555   12.058061   29.557005
1  212.580742  365.487782  405.718803
2   46.333180  537.684789  578.072579

Vectorizing Haversine distance calculation in Python

You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:

haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

For instance, with sample data as follows:

length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})

compare for 500 points:

def haver_vect(data):
    distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
    return distance

%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop

%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop

How to call data from a dataframe into Haversine function

Try this solution:

def haversine_np(lon1, lat1, lon2, lat2):
    """
    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees)

    All args must be of equal length.    

    """
    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

    c = 2 * np.arcsin(np.sqrt(a))
    km = 6367 * c
    return km

Demo:

In [17]: df
Out[17]:
        lat         lon
0 -6.081689  145.391881
1 -5.207083  145.788700
2 -5.826789  144.295861
3 -6.569828  146.726242
4 -9.443383  147.220050

In [18]: df['dist'] = \
    ...:     haversine_np(df.lon.shift(), df.lat.shift(), df.ix[1:, 'lon'], df.ix[1:, 'lat'])

In [19]: df
Out[19]:
        lat         lon        dist
0 -6.081689  145.391881         NaN
1 -5.207083  145.788700  106.638117
2 -5.826789  144.295861  178.907364
3 -6.569828  146.726242  280.904983
4 -9.443383  147.220050  323.913612

Pandas: calculate haversine distance within each group of rows

Try this approach:

import pandas as pd
import numpy as np

# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')

# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
    """
    slightly modified version: of http://stackoverflow.com/a/29546836/2901002

    Calculate the great circle distance between two points
    on the earth (specified in decimal degrees or in radians)

    All (lat, lon) coordinates must have numeric dtypes and be of equal length.

    """
    if to_radians:
        lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

    a = np.sin((lat2-lat1)/2.0)**2 + \
        np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

    return earth_radius * 2 * np.arcsin(np.sqrt(a))

Now we can calculate distances between coordinates belonging to the same id (group):

df['dist'] = \
    np.concatenate(df.groupby('id')
                     .apply(lambda x: haversine(x['lat'], x['lon'],
                                                x['lat'].shift(), x['lon'].shift())).values)

Result:

In [105]: df
Out[105]:
    id        lat        lon       dist
0    1  19.111841  72.910729        NaN
1    1  19.111342  72.908387   0.252243
2    2  19.111542  72.907387        NaN
3    2  19.137815  72.914085   3.004976
4    2  19.119677  72.905081   2.227658
5    2  19.129677  72.905081   1.111949
6    3  19.319677  72.905081        NaN
7    3  19.120217  72.907121  22.179974
8    4  19.420217  72.807121        NaN
9    4  19.520217  73.307121  53.584504
10   5  19.319677  72.905081        NaN
11   5  19.419677  72.805081  15.286775
12   5  19.629677  72.705081  25.594890
13   5  19.111860  72.911347  61.509917
14   5  19.111860  72.931346   2.101215
15   5  19.219677  72.605081  36.304756
16   6  19.319677  72.805082        NaN
17   6  19.419677  72.905086  15.287063

Haversine Function using Pandas Data Frame

The code you are using to calculate haversine distance receives one float in each argument, so indeed you need to pass floats for each argument. In this case iskeleler['lon'] and iskeleler['lat'] are Series.

This should work to calculate the distance between coordinates in the same row:

iskeleler.loc['density'] = iskeleler.apply(lambda row: haversine(
    row['lon'], row['lat'],
    row['LONGITUDE'], row['LATITUDE']
),axis=1)

But you are looking for a pair-wise distance which might require a for loop and this is not efficient. Try sklearn.metrics.pairwise.haversine_distances

from sklearn.metrics.pairwise import haversine_distances

distance_matrix = haversine_distances(
    iskeleler[['lat', 'lon']],
    iskeleler[['LATITUDE', 'LONGITUDE']]
)

If you prefer the table structure, then:

distance_table = pd.DataFrame(
    distance_matrix,
    index=pd.MultiIndex.from_frames(iskeleler[['lat', 'lon']]),
    columns=pd.MultiIndex.from_frames(iskeleler[['LATITUDE', 'LONGITUDE']]),
).stack([0, 1]).reset_index(name='distance')

This is an example, there are many ways to create the dataframe from the matrix.

Python: simplifying code by writing it in a more Pandas specific way

The main idea is to utilize shift to check consecutive rows. I'm also writing a get_dist function just wraps your existing distance function to make things more readable for when I use apply to compute distances.

def get_dist(row):
    lat1 = row['EQP_GPS_SPEC_LAT_CORD']
    long1 = row['EQP_GPS_SPEC_LONG_CORD']
    lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
    long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
    return haversine(lat1, long1, lat2, long2)

# Find consecutive rows with matching ser_no, and get coordinates.
coord_cols = ['EQP_GPS_SPEC_LAT_CORD', 'EQP_GPS_SPEC_LONG_CORD']
matching_ser = mttt_pings['ser_no'] == mttt_pings['ser_no'].shift(1)
shift_coords = mttt_pings.shift(1).loc[matching_ser, coord_cols]

# Join shifted coordinates and compute distances.
mttt_pings_shift = mttt_pings.join(shift_coords, how='inner', rsuffix='_2')
mttt_pings['hdist'] = mttt_pings_shift.apply(get_dist, axis=1)

In the above code, I've added the distances to your dataframe. If you want to get the result as a numpy array, you can do:

hdist = mttt_pings['hdist'].values

As a side note, you may want to consider using geopy.distance.vincenty to compute distances between lat/long coordinates. In general, vincenty is more accurate than haversine, although it may take longer to compute. Very minor modifications to the get_dist function are required to use vincenty.

from geopy.distance import vincenty

def get_dist(row):
    lat1 = row['EQP_GPS_SPEC_LAT_CORD']
    long1 = row['EQP_GPS_SPEC_LONG_CORD']
    lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
    long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
    return vincenty((lat1, long1), (lat2, long2)).km

Vectorised Haversine Formula with a Pandas Dataframe