Address Standardization Within a Database

Address standardization within a database

This is a tricky business ... equal parts Black Magic and Science. You will be amazed at the variations of Boulevard alone.

This is why I use the Google API. It can be time consuming, for the initial data-set, but only new adds would need to be resolved.

For example

https://maps.googleapis.com/maps/api/geocode/json?address=500 S Main St,Providence RI 02903

returns, in part

"formatted_address" : "500 S Main St, Providence, RI 02903, USA"

and the GOOD News is

https://maps.googleapis.com/maps/api/geocode/json?address=500 South Main Steet,Providence RI 02903

returns the same formatted address as the previous query

"formatted_address" : "500 S Main St, Providence, RI 02903, USA"

VBA Example:

Upon executing the following code ...

' VBA project Reference required:
' Microsoft XML, v3.0

Dim httpReq As New MSXML2.ServerXMLHTTP
httpReq.Open "GET", "https://maps.googleapis.com/maps/api/geocode/json?address=500 South Main Steet,Providence RI 02903", False
httpReq.send
Dim response As String
response = httpReq.responseText

... the string variable response contains the following JSON data:

{
"results" : [
{
"address_components" : [
{
"long_name" : "500",
"short_name" : "500",
"types" : [ "street_number" ]
},
{
"long_name" : "South Main Street",
"short_name" : "S Main St",
"types" : [ "route" ]
},
{
"long_name" : "Fox Point",
"short_name" : "Fox Point",
"types" : [ "neighborhood", "political" ]
},
{
"long_name" : "Providence",
"short_name" : "Providence",
"types" : [ "locality", "political" ]
},
{
"long_name" : "Providence County",
"short_name" : "Providence County",
"types" : [ "administrative_area_level_2", "political" ]
},
{
"long_name" : "Rhode Island",
"short_name" : "RI",
"types" : [ "administrative_area_level_1", "political" ]
},
{
"long_name" : "United States",
"short_name" : "US",
"types" : [ "country", "political" ]
},
{
"long_name" : "02903",
"short_name" : "02903",
"types" : [ "postal_code" ]
},
{
"long_name" : "2915",
"short_name" : "2915",
"types" : [ "postal_code_suffix" ]
}
],
"formatted_address" : "500 S Main St, Providence, RI 02903, USA",
"geometry" : {
"bounds" : {
"northeast" : {
"lat" : 41.82055829999999,
"lng" : -71.4028137
},
"southwest" : {
"lat" : 41.8204014,
"lng" : -71.40319219999999
}
},
"location" : {
"lat" : 41.8204799,
"lng" : -71.40300289999999
},
"location_type" : "ROOFTOP",
"viewport" : {
"northeast" : {
"lat" : 41.8218288302915,
"lng" : -71.40165396970851
},
"southwest" : {
"lat" : 41.8191308697085,
"lng" : -71.40435193029151
}
}
},
"partial_match" : true,
"place_id" : "ChIJicPQAT9F5IkRfq2njkYqZtE",
"types" : [ "premise" ]
}
],
"status" : "OK"
}

Is there common street addresses database design for all addresses of the world?

It is possible to represent addresses from lots of different countries in a standard set of fields. The basic idea of a named access route (thoroughfare) which the named or numbered buildings are located on is fairly standard, except in China sometimes. Other near universal concepts include: naming the settlement (city/town/village), which can be generically referred to as a locality; naming the region and assigning an alphanumeric postcode. Note that postcodes, also known as zip codes, are purely numeric only in some countries. You will need lots of fields if you really want to be generic.

The Universal Postal Union (UPU) provides address data for lots of countries in a standard format. Note that the UPU format holds all addresses (down to the available field precision) for a whole country, it is therefore relational. If storing customer addresses, where only a small fraction of all possible addresses will be stored, its better to use a single table (or flat format) containing all fields and one address per row.

A reasonable format for storing addresses would be as follows:

  • Address Lines 1-4
  • Locality
  • Region
  • Postcode (or zipcode)
  • Country

Address lines 1-4 can hold components such as:

  • Building
  • Sub-Building
  • Premise number (house number)
  • Premise Range
  • Thoroughfare
  • Sub-Thoroughfare
  • Double-Dependent Locality
  • Sub-Locality

Frequently only 3 address lines are used, but this is often insufficient. It is of course possible to require more lines to represent all addresses in the official format, but commas can always be used as line separators, meaning the information can still be captured.

Usually analysis of the data would be performed by locality, region, postcode and country and these elements are fairly easy for users to understand when entering data. This is why these elements should be stored as separate fields. However, don't force users to supply postcode or region, they may not be used locally.

Locality can be unclear, particularly the distinction between map locality and postal-locality. The postal locality is the one deemed by a postal authority which may sometimes be a nearby large town. However, the postcode will usually resolve any problems or discrepancies there, to allow correct delivery even if the official post-locality is not used.

SQL Database Design Best Practice (Addresses)

You're on the right track by breaking address out into its own table. I'd add a couple of additional suggestions.

  1. Consider taking the Address FK columns out of the Customers/Orders tables and creating junction tables instead. In other words, treat Customers/Addresses and Orders/Addresses as many-to-many relationships in your design now so you can easily support multiple addresses in the future. Yes, this means introducing more tables and joins, but the flexibility you gain is well worth the effort.

  2. Consider creating lookup tables for city, state and country entities. The city/state/country columns of the address table then consist of FKs pointing to these lookup tables. This allows you to guarantee consistent spellings across all addresses and gives you a place to store additional metadata (e.g., city population) if needed in the future.

Best Practice / Standard for storing an Address in a SQL Database

For international addresses, refer to the Universal Postal Union's Postal Addressing Systems database.

For U.S. addresses, refer to USPS Publication 28 "Postal Addressing Standards".

The USPS wants the following unpunctuated address components concatenated on a single line:

  • house number
  • predirectional (N, SE, etc.)
  • street
  • suffix (AVE, BLVD, etc.)
  • postdirectional (SW, E, etc.)
  • unit (APT, STE, etc.)
  • apartment/suite number

E.g. 102 N MAIN ST SE APT B

If you keep the entire address line as a single field in your database, input and editing is easy, but searches can be more difficult (eg, in the case SOUTH EAST LANE is the street EAST as in S EAST LN or is it LANE as in SE LANE ST?).

If you keep the address parsed into separate fields, searches for components like street name or apartments become easier, but you have to append everything together for output, you need CASS software to parse correctly, and PO boxes, rural route addresses, and APO/FPO addresses have special parsings.

A physical location with multiple addresses at that location is either a multiunit building, in which case letters/numbers after units like APT and STE designate the address, or it's a Commercial Mail Receiving Agency (eg, UPS store) and a maildrop/private mailbox number is appended (like 100 MAIN ST STE B PMB 102), or it's a business with one USPS delivery point and mail is routed after USPS delivery (which usually requires a separate mailstop field which the company might need but the USPS won't want on the address line).

A contact with more than one physical address is usually a business or person with a street address and a PO box. Note that it's common for each address to have a different ZIP code.

It's quite typical that one business transaction might have a shipping address and a billing address (again, with different ZIP codes). The information I keep for EACH address is:

  • name prefix (DR, MS, etc)
  • first name and initial
  • last name
  • name suffix (III, PHD, etc)
  • mail stop
  • company name
  • address (one line only per Pub 28 for USA)
  • city
  • state/province
  • ZIP/postal code
  • country

I typically print mail stops somewhere between the person's name and company because the country contains the state/ZIP which contains the city which contains the address which contains the company which contains the mail stop which contains the person. I use CASS software to validate and standardize addresses when entered or edited.

SQL address data is messy, how to clean it up in a query?

The best solution is to use a CASS certified address standardization program or service that will format and validate the address. Beyond the USPS which has tools for this, there are many third-party programs or services which provide this functionality. Address parsing is far more complicated than you might imagine and thus trying whip up a few queries to do it will be fraught with peril.

Google's Geocoding is another place to look.. Apparently Google requires you display the results to use their Geocoding service. That leaves using dedicated address parsers like the USPS or a third-party program.

Data Model for addresses with varying number of address lines

Typically, for an address, the data is not normalized over the lines. So, an address table would just have fields like AddressLine1 and AddressLine2.

The bigger geography information (example: city, state, country, postal code) would be stored in separate fields in the address record.

The reason for this is quite practical. Addresses are typically printed, and there is a limited amount of printing space available. If there are four lines, for instance, you have the name, address line 1, address line 2, and city/state/country/postal code.

If you really needed to store an unlimited number of lines, you would do it with an AddressLines table. The AddressLines table would have fields, such as:

  • AddressId -- the address record it belongs to
  • LineNumber
  • LineContents

However, this seems like overkill.

Your bigger problem is standardizing addresses. Have you given that any thought? (You know: "101 6th Avenue", "101 Sixth Ave.", and "101 Avenue of the Americas" are all the same address in New York City.)

How do you perform address validation?

Here's a free and sort of "outside the box" way to do it. Not 100% perfect, but it should reject blatantly non-existent addresses.

Submit the entire address to Google's geocoding web service. This service attempts to return the exact coordinates of the location you feed it, i.e. latitude and longitude.

In my experience if the address is invalid you will get a result of 602 from the service. There's definitely a possibility of false positives or false negatives, but used in conjunction with other consistency checks it could be useful.

(Yahoo's geocoding web service, on the other hand, will return the coordinates of the center of the town if the town exists but the rest of the address is bogus. Potentially useful as long as you pay close attention to the "precision" field in the result).

Open Source Address Scrubber?

Since I work in the mailing business ...

A mailable address is not geo-coding. One allows the USPS to deliver mail to and the other tells you where on earth that point is. The USPS does not geo-code their mailable addresses. It's useful for marking areas/regions of people for targeting.

You're not buying a license to the software, you're buying the data. The post office has lots of rules especially if you're doing this commercially and trying to get a better rate than first class. See USPS Domestic Mail Manual for the complete list of rules. The USPS moves zips and households between zips all the time. The company (I work for) pays the USPS for its updated mailing list so we can keep our DBs updated. Weekly.

Back to your question. Do you want to change the data into a common format (street -> st) or are you looking for duplicates and want to only store real mailable addresses ?

for common format; you can break the address into pieces, clean up the white space and apply a dictionary of terms/translations. Then apply some sql to find the duplicates. Keep in mind households (1 main st) are different from persons (john doe, 1 main st).

for the mailable addresses, well some of you (the readers) won't like this answer, but you want information and that isn't free. Someone spends time or money to acquire and maintain these lists. So, find a business model to acquire funds for the list or go to someone who will do it for you. Data and mail management

Realistically, Semaphore is pretty cheap, just keep in mind that the address db will have to be updated quarterly and $19/quarter is pretty cheap.

Another Address Scrubbing product. SAP PostalSoft. I don't know what the data will cost though.



Related Topics



Leave a reply



Submit