How Uber finds similar trips in same direction
Summary of talk given at Spark Summit 2016
https://www.youtube.com/watch?v=Ha7_Vf2eZvQ
Problem definition
How find similar trips in same direction
GPS data frequency varies
Input is latitude, longitude, epoch
google s2 cell : have to vary level of granularity based on trip length
Steps in solution :
-
densify all points; add points so you have all pts atmost X distance apart
-
discretize : map points to equal-sized S2 cell
this convert reals -> longs -
remove contiguous duplicates
-
directionality : use shingling to combine cell with previous cell
(i.e) convert {1, 2, 3} to {102, 203, 304}
https://en.wikipedia.org/wiki/W-shingling -
solve set overlap problem - find common shingles
using Locality sensitive hash
minhash = min { h(x) for all x in set S}
where h(x) = ax + b Mod m
where m = number of hash bins
How to increase and control probability
use multiple minhash func with different parameters -
LSH on Spark
-
Approach 1
Construct RDD[Trip]
hash values are also shuffle keys
then do groupByKey
compute Jaccard distance to verify -
Approach 2
same pair of trips are matched on h1 and h2 buckets
Use one more shuffle to dedup
Network vs Distance Computation -
Approach 3
Dont send actual trip vector in LSH and Dedup shuffles
Send only trip ID
After dedup, join back trip objects with one more shuffle
Then compute jaccard distance of each pair of matched trips
Jaccard(A,B) = | A intersect B | / | A union B |For trips with thousands of shingles, this saves network bandwidth
They use secondary hash after minhash for banding
Some distance functions have good companions in hash functions
- for jacard distance, it is minhash func
- for hamming, it is i-th value of vector x
- for cosine, it is cosine of dot product of x and random vector
How to generate thousands of hash functions
- Cache friendly approach
generate only 2 hash func
h1(x) = (a1.x + b1) mod m1
h2(x) = (a2.x + b2) mod m2
and then generate all others as
h_k(x) = h1(x) + i.h2(x) for i = 1 to num of hash