Haversine Distance between consecutive rows for each Customer
I'll reuse the vectorized haversine_np
function from derricw's answer:
def haversine_np(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
All args must be of equal length.
"""
lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
dlon = lon2 - lon1
dlat = lat2 - lat1
a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
c = 2 * np.arcsin(np.sqrt(a))
km = 6367 * c
return km
def distance(x):
y = x.shift()
return haversine_np(x['Lat'], x['Lon'], y['Lat'], y['Lon']).fillna(0)
df['Distance'] = df.groupby('Customer').apply(distance).reset_index(level=0, drop=True)
Result:
Customer Lat Lon Distance
0 A 1 2 0.000000
1 A 1 2 0.000000
2 B 3 2 0.000000
3 B 4 2 111.057417
Pandas: calculate haversine distance within each group of rows
Try this approach:
import pandas as pd
import numpy as np
# parse CSV to DataFrame. You may want to specify the separator (`sep='...'`)
df = pd.read_csv('/path/to/file.csv')
# vectorized haversine function
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
"""
slightly modified version: of http://stackoverflow.com/a/29546836/2901002
Calculate the great circle distance between two points
on the earth (specified in decimal degrees or in radians)
All (lat, lon) coordinates must have numeric dtypes and be of equal length.
"""
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
Now we can calculate distances between coordinates belonging to the same id
(group):
df['dist'] = \
np.concatenate(df.groupby('id')
.apply(lambda x: haversine(x['lat'], x['lon'],
x['lat'].shift(), x['lon'].shift())).values)
Result:
In [105]: df
Out[105]:
id lat lon dist
0 1 19.111841 72.910729 NaN
1 1 19.111342 72.908387 0.252243
2 2 19.111542 72.907387 NaN
3 2 19.137815 72.914085 3.004976
4 2 19.119677 72.905081 2.227658
5 2 19.129677 72.905081 1.111949
6 3 19.319677 72.905081 NaN
7 3 19.120217 72.907121 22.179974
8 4 19.420217 72.807121 NaN
9 4 19.520217 73.307121 53.584504
10 5 19.319677 72.905081 NaN
11 5 19.419677 72.805081 15.286775
12 5 19.629677 72.705081 25.594890
13 5 19.111860 72.911347 61.509917
14 5 19.111860 72.931346 2.101215
15 5 19.219677 72.605081 36.304756
16 6 19.319677 72.805082 NaN
17 6 19.419677 72.905086 15.287063
Faster way to calculate distance from coordinates on DataFrame
Using distance()
from GeoPy:
from geopy import distance
pd.merge(df2.rename(columns={'Latitude':'C-Lat','Longitude':'C-Lon'}), df1, how='cross') \
.assign(Distance=lambda r: \
r.apply(lambda x: distance.distance((x['C-Lat'],x['C-Lon']),(x['Latitude'],x['Longitude'])).miles, axis=1)
).drop(columns=['C-Lat','C-Lon'])
City Building Latitude Longitude Distance
0 Santa Barbra, CA One World Trade Center 40.713005 -74.013190 2464.602573
1 Santa Barbra, CA Central Park Tower 40.765957 -73.980844 2466.054087
2 Santa Barbra, CA Willis Tower 41.878872 -87.635908 1759.257288
3 Santa Barbra, CA 111 West 57th Street 40.764760 -73.977581 2466.230348
4 Santa Barbra, CA One Vanderbilt 40.752971 -73.978541 2466.233832
5 Washington, D.C. One World Trade Center 40.713005 -74.013190 203.461336
6 Washington, D.C. Central Park Tower 40.765957 -73.980844 207.017141
7 Washington, D.C. Willis Tower 41.878872 -87.635908 595.065660
8 Washington, D.C. 111 West 57th Street 40.764760 -73.977581 207.103384
9 Washington, D.C. One Vanderbilt 40.752971 -73.978541 206.571970
What is the fastest way to generate a matrix for distances between location with lat and lon?
First of all, do you need to use haversine
metric for distance calculation? Which implementation do you use? If you would use e.g. euclidean
metric your calculation would be faster but I guess you have good reasons why did you choose this metric.
In that case it may be better to use more optimal implementation of haversine
(but I do not know which implementation you use). Check e.g. this SO question.
I guess you are using pdist
and squareform
from scipy.spatial.distance
. When you look at the implementation that is behind (here) you will find they are using for loop. In that case you could rather use some vectorized implementation (e.g. this one from the linked question above).
import numpy as np
import itertools
from scipy.spatial.distance import pdist, squareform
from haversine import haversine # pip install haversine
# original approach
place_coordinates = [(x, y) for x in range(10) for y in range(10)]
d = pdist(place_coordinates, metric=haversine)
# approach using combinations
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
d2 = [haversine(x, y) for (x, y) in place_coordinates_comb]
# just ensure that using combinations give you the same results as using pdist
np.testing.assert_array_equal(d, d2)
# vectorized version (taken from the link above)
# 1) create combination (note that haversine implementation from the link above takes (lon1, lat1, lon2, lat2) as arguments, that's why we do flatten
place_coordinates_comb = itertools.combinations(place_coordinates, 2)
place_coordinates_comb_flatten = [(*x, *y) for (x, y) in place_coordinates_comb]
# 2) use format required by this impl
lon1, lat1, lon2, lat2 = np.array(place_coordinates_comb_flatten).T
# 3) vectorized comp
d_vect = haversine_np(lon1, lat1, lon2, lat2)
# it slightly differs from the original haversine package, but it's ok imo and vectorized implementation can be ofc improve to return exactly the same results
np.testing.assert_array_equal(d, d_vect)
When you compare times (absolute numbers will differ based on used machine):
%timeit pdist(place_coordinates, metric=haversine)
# 15.7 ms ± 364 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit haversine_np(lon1, lat1, lon2, lat2)
# 241 µs ± 7.07 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
That's quite a lot (~60x faster). When you have really long array (how many coordinates are you using?) this can help you a lot.
Finally, you can combine it using your code:
place_correlation = pd.DataFrame(squareform(d_vect), index=place_coordinates, columns=place_coordinates)
Additional improvement could be to use another metric (e.g. euclidean
that will be faster) to quickly say which distances are outside 10km and then calculate haversine
for the rest.
Related Topics
How to Add a Constant Column in a Spark Dataframe
Why Do You Need Explicitly Have the "Self" Argument in a Python Method
Replace Non-Ascii Characters with a Single Space
How to Pivot a Dataframe in Pandas
How to Extract a Single Value from a JSON Response
What Rules Does Pandas Use to Generate a View VS a Copy
Why Do I Get a Syntaxerror for a Unicode Escape in My File Path
How to Get Monitor Resolution in Python
What Is the Standard Way to Add N Seconds to Datetime.Time in Python
Mkdir -P Functionality in Python
Pandas Dataframe Line Plot Display Date on Xaxis