geocoding post

Quansight · Jun 27, 2024 · fa14d26 · fa14d26
1 parent f37c1ce
commit fa14d26
Showing 1 changed file with 161 additions and 0 deletions.
diff --git a/apps/consulting/posts/geocoding.md b/apps/consulting/posts/geocoding.md
@@ -0,0 +1,161 @@
+# Saving money and time using Polars, Polars plugins, and open source data.
+
+Geocoding is the practice of taking in an address and assigning a latitude-longitude coordinate
+to it. Doing so for millions of rows can be an expensive and slow process, as it
+typically relies on paid API services. Learn about how we saved a client time and
+money by leveraging open source tools and datasets for their geocoding needs.
+
+Our solution took their geocoding process from taking hours to taking minutes,
+and from costing tends of thousands of dollars per year, to just dozens.
+
+## What are geocoding and reverse-geocoding?
+
+Geocoding answers the question:
+
+> Given address "152 West Broncho Avenue, Texas, 15203", what's its latitude-longitude
+  coordinate?
+
+Reverse-geocoding answers the reverse:
+
+> Given the coordinate (-30.543534, 129.14236), what address does it correspond to?
+
+Both are useful in several applications:
+
+- tracking deliveries;
+- location tagging;
+- point-of-interest recommendations.
+
+Our client needed to geocode and reverse-geocode millions
+of rows at a time. It was costing them a lot of money and time:
+
+- Geocoding ~7,000,000 addresses: ~2-3 hours, $32,100 yearly subscription
+- Reverse geocoding ~7,000,000 coordinates: 35 hours, $35,000 (this was so slow
+  and expensive that they would not typically do it)
+
+We devised a solution which would could do about 70-80% of the work, and would take:
+
+- Geocoding: ~2-3 minutes, cost <insert cost here>
+- Reverse geocoding: ~7-8 minutes, cost <insert cost here>
+
+We're here to share our findings, and to give an overview of how we did it.
+
+## Open-source geocoding: single-node solution
+
+Indeed, there is a better way! Suppose we're starting with a batch of addresses
+and need to geocode them. The gist of the solution we delivered is:
+
+- collect a lot of data from open source datasets (such as OpenAddresses). This
+  forms what we'll refer to as our _lookup dataset_.
+- join input addresses with our lookup dataset, based on:
+  - address number
+  - road
+  - zip code (if available, else city)
+
+This is conceptually simple, but we encountered several hurdles when implementing it.
+
+### First hurdle: road names
+
+Road names vary between providers. For example, "west broncho avenue" might also appear
+as:
+
+- w. broncho ave
+- west broncho
+- w. broncho avenue
+- w. broncho
+
+We use the [libpostal](https://github.com/openvenues/libpostal)'s `expand_address` function,
+as well as some hand-crafted logic, to generate multiple variants of each address (in both the input
+and the lookup dataset), thus increasing the chances of finding matches.
+
+### Second hurdle: some addresses in the lookup don't have a zip code, and possibly neither a city
+
+Some of the OpenAddresses data contained all the information we needed, except zip code.
+In some cases, by leveraging other freely available data on zip code boundaries, as well as
+GeoPandas' spatial joins, we could assign a zip code to that data. However, that was not always
+sufficient - some rows remained zip-code-less.
+
+For zip-code-less rows, we would do the following:
+
+- if the lookup address has a city, then to join with the input addresses based on
+  <address number, road, city>
+- else, use the [polars-reverse-geocode](https://github.com/MarcoGorelli/polars-reverse-geocode)
+  Polars plugin (which we developed specially for the client, who kindly allowed us to open-source it)
+  to find the closest city to the coordinates in the lookup file, and then join with the input
+  addresses based on that
+
+The second option above doesn't necessarily provide an exact match, but was deemed good enough
+because it's only used as a second fallback option for addresses which weren't matched in the first
+two rounds.
+
+### Third hurdle: going out-of-memory
+
+The amount of data we collected was several gigabytes in size - much more than what our single-node
+16GB RAM machine could handle, which is why our client was previously using a cluster to process
+it. However, we found this to be unnecessary, because Polars' lazy execution made it very easy for
+us to not have to load in all the data at once. All we need to do is:
+
+1. express our business logic
+2. use `.collect` when we want to materialise our results
+3. let Polars figure out which rows and columns it needs to read from the input, and only read in those
+
+The overall impact was enormous: the geocoding process went from taking hours, to just 2-3 minutes.
+We weren't typically able to geocode _all_ the input data using our open source solution, but we could
+get far enough that it represented a significant cost saving, and the client could then complete the job
+with paid API services.
+
+## Open-source reverse-geocoding: AWS Lambda is all you need?
+
+Thus far, we've talked about geocoding. What about the reverse process, reverse-geocoding?
+This is where the success story becomes even bigger: not only did our solution run on a single
+node, it could run on AWS Lambda, where memory, time, and package size are all constrained.
+
+In order to describe our solution, we need to introduce the concept of geohashing. Geohashing
+involves taking a coordinate and assigning an alphanumeric string to it. A geohash identifies
+a region in space - the more digits you consider in the geohash, the smaller the area. For example,
+the geohash 3fs stretches out across thousands of kilometers and covers Montata and Arizona, whereas
+3fs94kfsj is only a few hundred meters long. Given a latitude and longitude coordinate, the geohash
+is very cheap to compute, and so it gives us an easy way to filter which data we need to read.
+
+Here's a simplified sketch of the solution we delivered:
+
+1. Start an AWS Lambda function `spawn-reverse-geocoder`.
+   Read in the given coordinates, and compute the unique geohashes present in the dataset.
+   Split the unique geohashes into batches of 10 geohashes each.
+2. For each batch of 10 geohashes, start another AWS Lambda function (`execute-reverse-geocoder`)
+   which takes all the data from our lookup dataset whose geohash matches any of the given geohashes,
+   and do a cross join. For each unique input coordinate, we only keep the row matching the smallest
+   haversine distance between the input coordinate and the lookup address. Write the result
+   to a temporary Parquet file.
+3. One all the `execute-reverse-geocoder` jobs have finished, concatenate all the temporary Parquet
+   files which they wrote into a single output file.
+
+This solution is easy to describe - the only issue is that no common dataframe library has in-built
+functionality for computing geohashes, nor for computing distances between pairs of coordinates.
+This is where one of Polars' killer features (extensibility) came into play: if Polars doesn't implement
+a function you need, you can always make a plugin that can do it for you. In this case, we used several
+plugins:
+
+- polars-hash, for computing geohashes
+- polars-distance, for computing the distance between pairs of coordinates
+- polars-reverse-geocode, for finding the closest state to a given coordinate
+
+All in all, our environment needed to contain:
+
+- Polars
+- 3 Polars plugins
+- s3fs, boto3, and fsspec for reading and writing cloud data
+
+Not only did it all fit comfortably into the AWS Lambda 250MB package size limit, execution was also
+fast enough that we could reverse-geocode millions of coordinates from across the United States in
+less than 10 minutes, staying within the 10GB memory limit.
+
+That's the power of lazy execution and Rust. If you too would like custom Rust and/or Python
+solutions for your use case, which can be easily and cheaply deployed, please contact
+Quansight Consulting.
+
+## What we did for Datum, and what we can do for you
+
+Would you like customised solutions to your business needs, based on open source tools,
+delivered by open source experts? We allowed Datum to save time and money, and could do the
+same for you! Please contact Quansight today.
+