Vectorised Haversine formula with a pandas dataframe
I can't confirm if the calculations are correct but the following worked:
In [11]:
from numpy import cos, sin, arcsin, sqrt
from math import radians
def haversine(row):
lon1 = -56.7213600
lat1 = 37.2175900
lon2 = row['LON']
lat2 = row['LAT']
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * arcsin(sqrt(a))
km = 6367 * c
return km
df['distance'] = df.apply(lambda row: haversine(row), axis=1)
df
Out[11]:
SEAZ LAT LON distance
index
1 296.40 58.731221 28.377411 6275.791920
2 274.72 56.814832 31.292324 6509.727368
3 192.25 52.064988 35.801864 6990.144378
4 34.34 68.818875 67.193367 7357.221846
5 271.05 56.669988 31.688062 6538.047542
6 131.88 48.554622 49.782773 8036.968198
7 350.71 64.774272 31.395378 6229.733699
8 214.44 53.519292 33.845856 6801.670843
9 1.46 67.943374 38.484252 6418.754323
10 273.55 53.343731 4.471666 4935.394528
The following code is actually slower on such a small dataframe but I applied it to a 100,000 row df:
In [35]:
%%timeit
df['LAT_rad'], df['LON_rad'] = np.radians(df['LAT']), np.radians(df['LON'])
df['dLON'] = df['LON_rad'] - math.radians(-56.7213600)
df['dLAT'] = df['LAT_rad'] - math.radians(37.2175900)
df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin(df['dLAT']/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(df['LAT_rad']) * np.sin(df['dLON']/2)**2))
1 loops, best of 3: 17.2 ms per loop
Compared to the apply function which took 4.3s so nearly 250 times quicker, something to note in the future
If we compress all the above in to a one-liner:
In [39]:
%timeit df['distance'] = 6367 * 2 * np.arcsin(np.sqrt(np.sin((np.radians(df['LAT']) - math.radians(37.2175900))/2)**2 + math.cos(math.radians(37.2175900)) * np.cos(np.radians(df['LAT'])) * np.sin((np.radians(df['LON']) - math.radians(-56.7213600))/2)**2))
100 loops, best of 3: 12.6 ms per loop
We observe further speed ups now a factor of ~341 times quicker.
How to use haversine distance using haversine library on pandas dataframe
You can use itertools.product
for creating all cases then use haversine
for getting results like the below:
import haversine as hs
import pandas as pd
import numpy as np
import itertools
res = []
for a,b in (itertools.product(*[df1.values , df2.values])):
res.append(hs.haversine(a,b))
m = int(np.sqrt(len(res)))
df = pd.DataFrame(np.asarray(res).reshape(m,m))
print(df)
Output:
0 1 2
0 587.500555 12.058061 29.557005
1 212.580742 365.487782 405.718803
2 46.333180 537.684789 578.072579
Vectorizing Haversine distance calculation in Python
You would provide your function as an argument to np.vectorize()
, and could then use it as an argument to pandas.groupby.apply
as illustrated below:
haver_vec = np.vectorize(haversine, otypes=[np.int16])
distance = df.groupby('id').apply(lambda x: pd.Series(haver_vec(df.coordinates, x.coordinates)))
For instance, with sample data as follows:
length = 500
df = pd.DataFrame({'id':np.arange(length), 'coordinates':tuple(zip(np.random.uniform(-90, 90, length), np.random.uniform(-180, 180, length)))})
compare for 500 points:
def haver_vect(data):
distance = data.groupby('id').apply(lambda x: pd.Series(haver_vec(data.coordinates, x.coordinates)))
return distance
%timeit haver_loop(df): 1 loops, best of 3: 35.5 s per loop
%timeit haver_vect(df): 1 loops, best of 3: 593 ms per loop
How to call data from a dataframe into Haversine function
Try this solution:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
Demo:
In [17]: df
Out[17]:
lat lon
0 -6.081689 145.391881
1 -5.207083 145.788700
2 -5.826789 144.295861
3 -6.569828 146.726242
4 -9.443383 147.220050
In [18]: df['dist'] = \
...: haversine_np(df.lon.shift(), df.lat.shift(), df.ix[1:, 'lon'], df.ix[1:, 'lat'])
In [19]: df
Out[19]:
lat lon dist
0 -6.081689 145.391881 NaN
1 -5.207083 145.788700 106.638117
2 -5.826789 144.295861 178.907364
3 -6.569828 146.726242 280.904983
4 -9.443383 147.220050 323.913612
Pandas: calculate haversine distance within each group of rows
Try this approach:
import pandas as pd
import numpy as np
# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')
# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
Now we can calculate distances between coordinates belonging to the same id
(group):
df['dist'] = \
np.concatenate(df.groupby('id')
.apply(lambda x: haversine(x['lat'], x['lon'],
x['lat'].shift(), x['lon'].shift())).values)
Result:
In [105]: df
Out[105]:
id lat lon dist
0 1 19.111841 72.910729 NaN
1 1 19.111342 72.908387 0.252243
2 2 19.111542 72.907387 NaN
3 2 19.137815 72.914085 3.004976
4 2 19.119677 72.905081 2.227658
5 2 19.129677 72.905081 1.111949
6 3 19.319677 72.905081 NaN
7 3 19.120217 72.907121 22.179974
8 4 19.420217 72.807121 NaN
9 4 19.520217 73.307121 53.584504
10 5 19.319677 72.905081 NaN
11 5 19.419677 72.805081 15.286775
12 5 19.629677 72.705081 25.594890
13 5 19.111860 72.911347 61.509917
14 5 19.111860 72.931346 2.101215
15 5 19.219677 72.605081 36.304756
16 6 19.319677 72.805082 NaN
17 6 19.419677 72.905086 15.287063
Haversine Function using Pandas Data Frame
The code you are using to calculate haversine distance receives one float in each argument, so indeed you need to pass floats for each argument. In this case iskeleler['lon']
and iskeleler['lat']
are Series.
This should work to calculate the distance between coordinates in the same row:
iskeleler.loc['density'] = iskeleler.apply(lambda row: haversine(
row['lon'], row['lat'],
row['LONGITUDE'], row['LATITUDE']
),axis=1)
But you are looking for a pair-wise distance which might require a for
loop and this is not efficient. Try sklearn.metrics.pairwise.haversine_distances
from sklearn.metrics.pairwise import haversine_distances
distance_matrix = haversine_distances(
iskeleler[['lat', 'lon']],
iskeleler[['LATITUDE', 'LONGITUDE']]
)
If you prefer the table structure, then:
distance_table = pd.DataFrame(
distance_matrix,
index=pd.MultiIndex.from_frames(iskeleler[['lat', 'lon']]),
columns=pd.MultiIndex.from_frames(iskeleler[['LATITUDE', 'LONGITUDE']]),
).stack([0, 1]).reset_index(name='distance')
This is an example, there are many ways to create the dataframe from the matrix.
Python: simplifying code by writing it in a more Pandas specific way
The main idea is to utilize shift
to check consecutive rows. I'm also writing a get_dist
function just wraps your existing distance function to make things more readable for when I use apply
to compute distances.
def get_dist(row):
lat1 = row['EQP_GPS_SPEC_LAT_CORD']
long1 = row['EQP_GPS_SPEC_LONG_CORD']
lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
return haversine(lat1, long1, lat2, long2)
# Find consecutive rows with matching ser_no, and get coordinates.
coord_cols = ['EQP_GPS_SPEC_LAT_CORD', 'EQP_GPS_SPEC_LONG_CORD']
matching_ser = mttt_pings['ser_no'] == mttt_pings['ser_no'].shift(1)
shift_coords = mttt_pings.shift(1).loc[matching_ser, coord_cols]
# Join shifted coordinates and compute distances.
mttt_pings_shift = mttt_pings.join(shift_coords, how='inner', rsuffix='_2')
mttt_pings['hdist'] = mttt_pings_shift.apply(get_dist, axis=1)
In the above code, I've added the distances to your dataframe. If you want to get the result as a numpy array, you can do:
hdist = mttt_pings['hdist'].values
As a side note, you may want to consider using geopy.distance.vincenty
to compute distances between lat/long coordinates. In general, vincenty
is more accurate than haversine
, although it may take longer to compute. Very minor modifications to the get_dist
function are required to use vincenty
.
from geopy.distance import vincenty
def get_dist(row):
lat1 = row['EQP_GPS_SPEC_LAT_CORD']
long1 = row['EQP_GPS_SPEC_LONG_CORD']
lat2 = row['EQP_GPS_SPEC_LAT_CORD_2']
long2 = row['EQP_GPS_SPEC_LONG_CORD_2']
return vincenty((lat1, long1), (lat2, long2)).km
Related Topics
How to Use Python to Execute a Curl Command
Is There Any Difference Between "Foo Is None" and "Foo == None"
How to Scrape a Website Which Requires Login Using Python and Beautifulsoup
MAC Os X - Environmenterror: MySQL_Config Not Found
Exif Manipulation Library for Python
Too Many Different Python Versions on My System and Causing Problems
How to Catch a Numpy Warning Like It's an Exception (Not Just for Testing)
How to Normalize a Numpy Array to a Unit Vector
How Does This Input Work with the Python 'Any' Function
How to Get Most Informative Features for Scikit-Learn Classifiers
Matplotlib Legend Markers Only Once
Link Atlas/Mkl to an Installed Numpy
Pyqt4 Wait in Thread for User Input from Gui
Make Requests Using Python Over Tor
Listing Available Com Ports with Python
Fastest Way to Convert a Dict's Keys & Values from 'Unicode' to 'Str'