Vectorised Haversine Formula with a Pandas Dataframe

Vectorised Haversine formula with a pandas dataframe

I can't confirm if the calculations are correct but the following worked:

In [11]:

from numpy import cos, sin, arcsin, sqrt
from math import radians

def haversine(row):
lon1 = -56.7213600
lat1 = 37.2175900
lon2 = row['LON']
lat2 = row['LAT']
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * arcsin(sqrt(a))
km = 6367 * c
return km

df['distance'] = df.apply(lambda row: haversine(row), axis=1)
df
Out[11]:
SEAZ LAT LON distance
index
1 296.40 58.731221 28.377411 6275.791920
2 274.72 56.814832 31.292324 6509.727368
3 192.25 52.064988 35.801864 6990.144378
4 34.34 68.818875 67.193367 7357.221846
5 271.05 56.669988 31.688062 6538.047542
6 131.88 48.554622 49.782773 8036.968198
7 350.71 64.774272 31.395378 6229.733699
8 214.44 53.519292 33.845856 6801.670843
9 1.46 67.943374 38.484252 6418.754323
10 273.55 53.343731 4.471666 4935.394528

The following code is actually slower on such a small dataframe but I applied it to a 100,000 row df:

In [35]:

%%timeit
df['LAT_rad'], df['LON_rad'] = np.radians(df['LAT']), np.radians(df['LON'])
df['dLON'] = df['LON_rad'] - math.radians(-56.7213600)
df['dLAT'] = df['LAT_rad'] - math.radians(37.2175900)
df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin(df['dLAT']/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(df['LAT_rad']) * np.sin(df['dLON']/2)**2))

1 loops, best of 3: 17.2 ms per loop

Compared to the apply function which took 4.3s so nearly 250 times quicker, something to note in the future

If we compress all the above in to a one-liner:

In [39]:

%timeit df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['LAT']) - math.radians(37.2175900))/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(np.radians(df['LAT'])) * np.sin((np.radians(df['LON']) - math.radians(-56.7213600))/2)**2))
100 loops, best of 3: 12.6 ms per loop

We observe further speed ups now a factor of ~341 times quicker.

How to use haversine distance using haversine library on pandas dataframe

You can use itertools.product for creating all cases then use haversine for getting results like the below:

import haversine as hs
import pandas as pd
import numpy as np
import itertools

res = []
for a,b in (itertools.product(*[df1.values , df2.values])):
res.append(hs.haversine(a,b))

m = int(np.sqrt(len(res)))
df = pd.DataFrame(np.asarray(res).reshape(m,m))
print(df)

Output:

            0           1           2
0 587.500555 12.058061 29.557005
1 212.580742 365.487782 405.718803
2 46.333180 537.684789 578.072579

Vectorizing Haversine distance calculation in Python

You would provide your function as an argument to np.vectorize(), and could then use it as an argument to pandas.groupby.apply as illustrated below:

haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))

For instance, with sample data as follows:

length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})

compare for 500 points:

def haver_vect(data):
distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
return distance

%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop

%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop

How to call data from a dataframe into Haversine function

Try this solution:

def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)

All args must be of equal length.

"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])

dlon = lon2 - lon1
dlat = lat2 - lat1

a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2

c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km

Demo:

In [17]: df
Out[17]:
lat lon
0 -6.081689 145.391881
1 -5.207083 145.788700
2 -5.826789 144.295861
3 -6.569828 146.726242
4 -9.443383 147.220050

In [18]: df['dist'] = \
...: haversine_np(df.lon.shift(), df.lat.shift(), df.ix[1:, 'lon'], df.ix[1:, 'lat'])

In [19]: df
Out[19]:
lat lon dist
0 -6.081689 145.391881 NaN
1 -5.207083 145.788700 106.638117
2 -5.826789 144.295861 178.907364
3 -6.569828 146.726242 280.904983
4 -9.443383 147.220050 323.913612

Pandas: calculate haversine distance within each group of rows

Try this approach:

import pandas as pd
import numpy as np

# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')

# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002

Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)

All (lat, lon) coordinates must have numeric dtypes and be of equal length.

"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])

a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2

return earth_radius * 2 * np.arcsin(np.sqrt(a))

Now we can calculate distances between coordinates belonging to the same id (group):

df['dist'] = \
np.concatenate(df.groupby('id')
.apply(lambda x: haversine(x['lat'], x['lon'],
x['lat'].shift(), x['lon'].shift())).values)

Result:

In [105]: df
Out[105]:
id lat lon dist
0 1 19.111841 72.910729 NaN
1 1 19.111342 72.908387 0.252243
2 2 19.111542 72.907387 NaN
3 2 19.137815 72.914085 3.004976
4 2 19.119677 72.905081 2.227658
5 2 19.129677 72.905081 1.111949
6 3 19.319677 72.905081 NaN
7 3 19.120217 72.907121 22.179974
8 4 19.420217 72.807121 NaN
9 4 19.520217 73.307121 53.584504
10 5 19.319677 72.905081 NaN
11 5 19.419677 72.805081 15.286775
12 5 19.629677 72.705081 25.594890
13 5 19.111860 72.911347 61.509917
14 5 19.111860 72.931346 2.101215
15 5 19.219677 72.605081 36.304756
16 6 19.319677 72.805082 NaN
17 6 19.419677 72.905086 15.287063

Haversine Function using Pandas Data Frame

The code you are using to calculate haversine distance receives one float in each argument, so indeed you need to pass floats for each argument. In this case iskeleler['lon'] and iskeleler['lat'] are Series.

This should work to calculate the distance between coordinates in the same row:

iskeleler.loc['density'] = iskeleler.apply(lambda row: haversine(
row['lon'], row['lat'],
row['LONGITUDE'], row['LATITUDE']
),axis=1)

But you are looking for a pair-wise distance which might require a for loop and this is not efficient. Try sklearn.metrics.pairwise.haversine_distances

from sklearn.metrics.pairwise import haversine_distances

distance_matrix = haversine_distances(
iskeleler[['lat', 'lon']],
iskeleler[['LATITUDE', 'LONGITUDE']]
)

If you prefer the table structure, then:

distance_table = pd.DataFrame(
distance_matrix,
index=pd.MultiIndex.from_frames(iskeleler[['lat', 'lon']]),
columns=pd.MultiIndex.from_frames(iskeleler[['LATITUDE', 'LONGITUDE']]),
).stack([0, 1]).reset_index(name='distance')

This is an example, there are many ways to create the dataframe from the matrix.

Python: simplifying code by writing it in a more Pandas specific way

The main idea is to utilize shift to check consecutive rows. I'm also writing a get_dist function just wraps your existing distance function to make things more readable for when I use apply to compute distances.

def get_dist(row):
lat1 = row['EQP_GPS_SPEC_LAT_CORD']
long1 = row['EQP_GPS_SPEC_LONG_CORD']
lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
return haversine(lat1, long1, lat2, long2)

# Find consecutive rows with matching ser_no, and get coordinates.
coord_cols = ['EQP_GPS_SPEC_LAT_CORD', 'EQP_GPS_SPEC_LONG_CORD']
matching_ser = mttt_pings['ser_no'] == mttt_pings['ser_no'].shift(1)
shift_coords = mttt_pings.shift(1).loc[matching_ser, coord_cols]

# Join shifted coordinates and compute distances.
mttt_pings_shift = mttt_pings.join(shift_coords, how='inner', rsuffix='_2')
mttt_pings['hdist'] = mttt_pings_shift.apply(get_dist, axis=1)

In the above code, I've added the distances to your dataframe. If you want to get the result as a numpy array, you can do:

hdist = mttt_pings['hdist'].values

As a side note, you may want to consider using geopy.distance.vincenty to compute distances between lat/long coordinates. In general, vincenty is more accurate than haversine, although it may take longer to compute. Very minor modifications to the get_dist function are required to use vincenty.

from geopy.distance import vincenty

def get_dist(row):
lat1 = row['EQP_GPS_SPEC_LAT_CORD']
long1 = row['EQP_GPS_SPEC_LONG_CORD']
lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
return vincenty((lat1, long1), (lat2, long2)).km


Related Topics



Leave a reply



Submit