Data Transformation and Analysis Geospatial

Geocoding Addresses Using the OpenStreetMap API

Use the OpenStreetMap API to batch process large lists of addresses into point coordinates.

Ideally, the locations of people’s homes, community centers, national society branches, hospitals, or any other specific point would have coordinates that facilitate mapping, but the reality is that many datasets store locations as plain text strings. If you plug those addresses into a search engine or mapping platform, you can find the coordinates, but what if you need to do this at scale?

Scenario

The Emergency Social Safety Net (ESSN) program provides monthly cash assistance via debit cards to the most vulnerable refugees in Türkiye. The beneficiaries of this program need to be geolocated as part of the data management process, and the program needed a way to quickly code them at scale using low-cost methods.

This guide walks through the process shared with the SIMS Network by Pinar Öner, Data Scientist with the IFRC Türkiye delegation, covering how she solved the problem of first geocoding to locate addresses, and then reverse geocoding to validate and visualize the data.

Pre-Processing and Geocoding with OpenStreetMap

Install Dependencies

The Python scripting utilized in this walkthrough requires the local installation of several libraries. This can be done with a simple pip install from the command line. As mentioned in the Google Maps version of this walkthrough, you should use a virtual environment when installing libraries, but covering that process is outside of the scope here. There are a number of useful guides available with a quick search.

To install what we need, run: pip install pandas numpy python-dateutil openpyxl geopy IPython. You may have one or more of these already installed, in which case your computer will simply skip the installation.

Build the Geocoding Script

Start by opening your code editor and creating a new Python file called geocoder.py. Save this file in a dedicated folder on your computer. Again, this is where you would also create a virtual environment with venv.

Import Dependencies

In your code editor of choice, open a new file and save it to a new project folder on your computer. At the top of the file, let’s import our libraries:

import os, math, time
import pandas as pd
import numpy as np
from dateutil.relativedelta import relativedelta
from datetime import datetime
import openpyxl as pxl
from random import sample

from geopy.extra.rate_limiter import RateLimiter
from geopy.geocoders import Nominatim
from geopy.exc import GeocoderTimedOut
from IPython.display import Image, display

import warnings
warnings.filterwarnings("ignore")

Load the Data

The dataset we’re using for this walkthrough can be downloaded here—put it inside the same folder where your . The rest of the walkthrough will reference the table in that Excel file, which has these columns:

  • IL
  • ILCEADI
  • NEIGHBORHOOD
  • IL_EN
  • IL CEADI_EN
  • NEIGHBORHOOD_EN
  • ADDRESS_TR
  • ADDRESS_EN

To load the file and make it accessible to the rest of the script, we’re going to import it and simultaneously convert it to a pandas DataFrame. DataFrames are special objects that can access methods from the pandas library, which makes it easier to work with data like this.

# Save this version for SIMS 
folder = os.getcwd()
df_subset = pd.read_excel(folder + '/' + 'Sample_Address_Information_v2.xlsx')
len(df_subset)

Load and Merge the Common Operational Dataset

Common Operational Datasets (CODs) are authoritative reference datasets needed to support operations and decision-making for all actors in a humanitarian response. CODs are ‘best available’ datasets that ensure consistency and simplify the discovery and exchange of key data. For this walkthrough, we’re going to use the Türkiye COD—download this file and put it into the same folder with your Python script.

# Load common operational dataset
# Common operational dataset is downloaded from here: https://data.humdata.org/dataset/cod-ab-tur?
# This dataset includes administrative boundaries of Türkiye 
df_common = pd.read_excel(folder + '/common-operational-dataset.xlsx', sheet_name = 'ADM3')

# Merge address information and common operational dataset
df_merged = df_subset.merge(df_common, left_on = ['IL_EN', 'ILCEADI_EN', 'NEIGHBORHOOD_EN'], 
                    right_on = ['adm1_en', 'adm2_en', 'adm3_en'], how = 'left')

# Take the necessary columns
df_final = df_merged[['ADDRESS_TR', 'ADDRESS_EN', 'IL', 'IL_EN', 'ILCEADI', 'ILCEADI_EN', 
                    'NEIGHBORHOOD', 'NEIGHBORHOOD_EN', 'latitude', 'longitude', 'pcode']]

# Let's see the first 5 rows
df_final.head()
  • df_common: We use the read_excel method to specify the file, including the sheet name we want to reference.
  • df_merged: The merge method does a left join of the three columns we want, by matching with the three admin levels from the COD file. Left joins maintain all of the data from the first table we’re joining and matches on the right, where a match exists.
  • df_final: From the df_merged DataFrame created above, we mow specify which columns we want. Since we just merged these two DataFrames (what would be a “JOIN” in SQL), we now have access to the additional columns you see there from the COD file.

Check the Merge

It’s always a good idea to check your results before proceeding, so we can ensure that the merge worked as expected.

print('Row level check:')
print('----------------')

no_rows = df_final['latitude'].notna().sum()
all_rows = len(df_final)
na_rows = df_final['latitude'].isna().sum()
print(f"Common operational dataset could find {no_rows:,} (Percentage: {no_rows/all_rows:.3f}) rows out of {all_rows:,} rows in the dataset")
print(f"Number of missing coordinates (rows): {na_rows} (Percentage: {na_rows/all_rows:.3f})")

These print statements will return some data to show us how many matches were made between the file with our addresses and the COD file.

Row level check:
----------------
Common operational dataset could find 376 (Percentage: 0.752) rows out of 500 rows in the dataset
Number of missing coordinates (rows): 124 (Percentage: 0.248)

In this case, it found about 75 percent of them. If we want to see what was missing from the results, we can use pandas’ helpful isnull() method:

df_final.isnull().sum()
ADDRESS_TR           0
ADDRESS_EN           0
IL                   0
IL_EN                0
ILCEADI              0
ILCEADI_EN           0
NEIGHBORHOOD         0
NEIGHBORHOOD_EN      0
latitude           124
longitude          124
pcode              124
dtype: int64

It looks like the missing values are the lat, long, and p-code from the COD.

Geocode the Addresses with OSM

Let’s make a copy of the DataFrame we created above, so that we preserve the original and can work on the copy:

df_prep = df_final.copy()

Next, we’ll geocode in several steps:

%%time

# 0 - Filter NA values for geocoding
df_to_geocode_na = df_prep[df_prep['latitude'].isna()==True]

locator = Nominatim(user_agent= 'myGeoLocator') 

# 1 - convenient function to delay between geocoding calls
geocode = RateLimiter(locator.geocode, min_delay_seconds=2)

# 2 - create location column
df_to_geocode_na['location'] = df_to_geocode_na['ADDRESS_TR'].apply(geocode)

# 3 - create longitude and latitude from location column (returns tuple)
df_to_geocode_na['point'] = df_to_geocode_na['location'].apply(lambda loc: tuple(loc.point) if loc else None)

# 4 - split point column into latitude, longitude columns 
# (Used capitalized column names to distinct between COD results and OSM results)
df_to_geocode_na['latitude'] = [value[0] if value != None else None for value in df_to_geocode_na['point']]
df_to_geocode_na['longitude'] = [value[1] if value != None else None for value in df_to_geocode_na['point']]

# 5 - Concat with geocoded subset 
df_geocoded = pd.concat([df_prep, df_to_geocode_na], ignore_index = True)
 
# 6 - Exclude NA values in order to avoid duplications 
df_geocoded = df_geocoded[df_geocoded['latitude'].notna()] 

# Check NA values
df_geocoded.isnull().sum()
  • df_to_geocode_na: First, we filter out missing values.
  • locator: We take advantage of the geopy library by creating an instance of the Nominatim service. The user_agent parameter is used to identify your application when making requests to the Nominatim service.
  • geocode: The locator.geocode parameter is referencing what we created in the bullet above. We wrap it inside RateLimiter, we can meter how quickly the function sends off the data to the server. This can help for servers that might get overwhelmed by many requests.
  • df_to_geocode_na['location']: Using the .apply() method allows us to essentially loop over the rows in df_to_geocode_na, which you’ll remember is the DataFrame we created to filter out missing values. Here, we’re specifying the ADDRESS_TR column. The ['location'] that we tack onto df_to_geocode_na at the beginning means we’re specifying a new column, and for each row in the new column, we’re passing the address data to our geocoder.
  • df_to_geocode_na['point']: Next, we create another column, this time called point. This will then follow a similar pattern as the line above, by running down the list of values in the location column we just created and applying the same lambda function. A lambda function allows us to create a simple “anonymous” function inline. The loc is the iterator that each run of the lambda function will use, and if there is a value on a given run, it will save it as a tuple, otherwise it skips it.
  • df_to_geocode_na['latitude'] and df_to_geocode_na['longitude']: We now have a column called point, where each row contains a tuple of two values: the lat and long of the address. We need to split these into two separate columns. These two lines use Python list comprehensions to assign the first ([0]) and second ([1]) values from the point column we created above to two new columns called, latitude and longitude.
  • df_geocoded: We merge the df_prep (this is the copy we made of the original DataFrame that we created in the previous section) with the DataFrame we created in this section.
  • df_geocoded: The second instance of this variable then filters out any rows where location data is missing. In this case, we simply use the latitude column as a proxy for a row that is not missing data.
  • df_geocoded.isnull().sum(): Finally, we count how many rows returned data.

Exit mobile version