Is There Common Street Addresses Database Design for All Addresses of the World

Is there common street addresses database design for all addresses of the world?

It is possible to represent addresses from lots of different countries in a standard set of fields. The basic idea of a named access route (thoroughfare) which the named or numbered buildings are located on is fairly standard, except in China sometimes. Other near universal concepts include: naming the settlement (city/town/village), which can be generically referred to as a locality; naming the region and assigning an alphanumeric postcode. Note that postcodes, also known as zip codes, are purely numeric only in some countries. You will need lots of fields if you really want to be generic.

The Universal Postal Union (UPU) provides address data for lots of countries in a standard format. Note that the UPU format holds all addresses (down to the available field precision) for a whole country, it is therefore relational. If storing customer addresses, where only a small fraction of all possible addresses will be stored, its better to use a single table (or flat format) containing all fields and one address per row.

A reasonable format for storing addresses would be as follows:

  • Address Lines 1-4
  • Locality
  • Region
  • Postcode (or zipcode)
  • Country

Address lines 1-4 can hold components such as:

  • Building
  • Sub-Building
  • Premise number (house number)
  • Premise Range
  • Thoroughfare
  • Sub-Thoroughfare
  • Double-Dependent Locality
  • Sub-Locality

Frequently only 3 address lines are used, but this is often insufficient. It is of course possible to require more lines to represent all addresses in the official format, but commas can always be used as line separators, meaning the information can still be captured.

Usually analysis of the data would be performed by locality, region, postcode and country and these elements are fairly easy for users to understand when entering data. This is why these elements should be stored as separate fields. However, don't force users to supply postcode or region, they may not be used locally.

Locality can be unclear, particularly the distinction between map locality and postal-locality. The postal locality is the one deemed by a postal authority which may sometimes be a nearby large town. However, the postcode will usually resolve any problems or discrepancies there, to allow correct delivery even if the official post-locality is not used.

How should international geographical addresses be stored in a relational database?

I will summarize my thoughts from my blog post - A lesson in address storage (on archive.org).

On my current project [I work for a logistics company] we're storing international addresses. I've done research on addresses all over the world in the design of this portion of the database. There's a lot of different formats. In the Western world we tend to use a fairly uniform format - a few differences but they're mostly:

  • Street Number - Numeric
  • House or Building Name - [VarChar - in the UK some houses/buildings are identified by name, not by number]
  • Street Number Suffix [VarChar, although in most cases, Char(1) would suffice]
    • A, B etc
  • Street Name [VarChar]
  • Street Type [VarChar or Int if you have a StreetTypes table]
    • So far, I've found 262 unique types in the English speaking world, there are likely more, and don't forget other languages i.e. Strasse, Rue etc.
  • Street Direction [VarChar(2)]
    • N, E, S, W, NE, SE, NW, SW
  • Address Type [VarChar or Int if you have an AddressTypes table]
    • PO Box
    • Apartment
    • Building
    • Floor
    • Office
    • Suite
    • etc...
  • Address Type Identifier [VarChar]
    • i.e. Box Number, Apartment Number, Floor Number remember apartment numbers and offices sometimes have alphanumeric info - like 1A
  • Local Municipality [VarChar or Int if you have a Municipalities table]
    • For instance, if your hamlet/village appears in the address before the town.
  • City/Town [VarChar or Int if you have a Cities table]
  • Governing District [VarChar or Int if you have a Districts table]
    • State (U.S.)
    • Province (Canada)
    • Federal District (Mexico)
    • County (U.K.)
    • etc...
  • Postal Area [VarChar]
    • Zip (U.S.)
    • Postal Code (Canada, Mexico)
    • Postcode (U.K.)
  • Country [VarChar or Int if you have a Countries table]

This appears to cover most countries but the ordering of the fields may be displayed differently. You can find a list of display formats at http://www.bitboost.com/ref/international-address-formats.html#Formats

For instance, in many countries, the postal code falls before the city name and the street number falls after the street name. In Canada, U.S. and the U.K. the street number precedes the street name and the postal code (or ZIP) comes after the city name.

In answer to your question about separation of the addresses into different countries, I wouldn't suggest it, it will just make life harder in other areas - for instance reporting. The format I've provided covers all the addresses in our logistics database which covers USA, Canada, Mexico and the UK without any problems. It also covers all of our European, Chinese, Japanese and Malaysian addresses. I can't speak for other countries but I haven't yet had to store an address from a country that these fields won't support.

I don't suggest going with the Address1, Address2, Address3 format suggested by others and seen in many databases because parsing address information out of an alphanumeric string isn't as simple as it might first seem - especially if data isn't entered correctly, due to misinformation, typo, misspelling etc. If you separate your fields you can use distance algorithms to check for likely meaning, use probability to check street name against postal code and street number or to check province and city against street name etc. Try doing any of that when you've got a string denoting your whole street address. It's not a trivial matter by any stretch of the imagination.

QA on an address database is a headache, period. The easiest way to simplify your life in this area is to make sure all the fields hold only a single piece of information that can be automatically verified as correct at entry time. Probability, distance algorithms and regular expressions can check for validity of entry and provide feedback to the user as to what their mistake was and suggest suitable corrections.

One caveat to be aware of is roads with names that are also street types - if you're covering Canada you need to be aware of "Avenue Road" in Toronto which will trip you up big time if you're using the Address1, 2, 3 format. This likely occurs in other places too, although I'm not aware of them - this single instance was enough for me to scream WTF?!

SQL Database Design Best Practice (Addresses)

You're on the right track by breaking address out into its own table. I'd add a couple of additional suggestions.

  1. Consider taking the Address FK columns out of the Customers/Orders tables and creating junction tables instead. In other words, treat Customers/Addresses and Orders/Addresses as many-to-many relationships in your design now so you can easily support multiple addresses in the future. Yes, this means introducing more tables and joins, but the flexibility you gain is well worth the effort.

  2. Consider creating lookup tables for city, state and country entities. The city/state/country columns of the address table then consist of FKs pointing to these lookup tables. This allows you to guarantee consistent spellings across all addresses and gives you a place to store additional metadata (e.g., city population) if needed in the future.

How to best represent addresses in a database

A DBA where I used to work told me this gem, and it's worked great for us (the first two steps are the same as in your solution):

  1. Remove AddressLine1, AddressLine2, City, StateOrProvince, PostalCode from Customers, Contacts, Facilities and Clients.
  2. Create AddressTypes table with fields AddressTypeID, AddressTypeName, AddressTypeDescription, AddressTypeActive, LastUpdateUser, LastUpdateTime
  3. Create Addresses table with fields AddressID(PK), AddressTypeID(FK), AddressLine1, AddressLine2, City, StateOrProvince, PostalCode , LastUpdateUser, LastUpdateTime, CustomerID(FK), ClientID(FK), ContactID(FK), FacilityID(FK)
  4. On the addresses table, set up a constraint so that only one of the CustomerID, ClientID, ContactID, or FacilityID foreign keys may be non-NULL at a time.

This way you've got all your addresses in one table, they can reference any record you need, your referential integrity is intact, and you don't have in intermediate table that you have to traverse.

The downside is that if you want to add addresses to a new class of object (e.g. an Employee table), you have to add a EmployeeID column to the Addresses table, but that's pretty easy.

ISO Standard Street Addresses?

No; each country defines its own standard.

There have been a number of questions about this in times past, including:

  • Best practices for storing postal addresses in an RDBMS
  • Is there a common street address database design for all addresses of the world

The second of those itself has references to a number of other SO questions.

You might want to check out the grandiosely named Universal Postal Union and its standards.

Country, city, state, Post code table design

The following tables could be used for addresses in North America. Counties (which would also contain Parishes and other similar geographical subdivisions) is US-centric.

In the addresses table I allow the city_id and county_id to be NULL because, based on HLGEM's comment, you can be in a city OR a county but not both in the state of Virginia. The application code would need to enforce the rule that at least one of these fields is not NULL.

addresses
id unsigned int(P)
street varchar(50)
extended varchar(50) // Default NULL
city_id unsigned int(F cities.id) Default NULL
county_id unsigned int(F counties.id) Default NULL
zip varchar(6) // Will handle all north American zips
zip4 char(4) // Default NULL
lat decimal(10,8) // Allows for ~1mm accuracy at equator. Default NULL
lon decimal(11,8) // Allows for ~1mm accuracy at equator. Default NULL

cities
id unsigned int(P)
state_id unsigned int(F states.id)
name varchar(45)
fips unsigned int // Default NULL
census_code unsigned int // Default NULL
census_class_code char(2) // Default NULL
gsa_code unsigned int // Default NULL
opm_code unsigned int // Default NULL

city_id and county_id form the Primary key and are foreign keys to their respective tables.

cities_counties
city_id unsigned int(F cities.id) ---\_(P)
county_id unsigned int(F counties.id)--/

counties
id unsigned int(P)
state_id unsigned int(F states.id)
name varchar(50)
fips unsigned int // Default NULL

See http://en.wikipedia.org/wiki/ISO_3166-1

countries
id char(2)(P)
iso3 char(3)(U)
iso_num char(3)(U)
name varchar(44)(U)

See FIPS state codes, FIPS region codes and ISO 3166-2.

states
id unsigned int(P)
country_id char(2)(F countries.id)
code varchar(3)(I)
name varchar(45)
fips unsigned int // Default NULL

Is there a benefit to storing street address data distinctly instead of just as a string?

I wrote a whole blog post about this a while back. There are very good reasons to store each piece of data in a separate field. Not least for validation of address data.

Of course, it depends what industry you're in and what the information is being used for. If invalid address data isn't costing your company anything, then by all means store invalid data. Be aware though that down the road you may wish to use this data for mailings, demographic reports etc. If the data is invalid, it's not trivial to fix it after the fact.

Here's my blog post:

http://www.endswithsaurus.com/2009/07/lesson-in-address-storage.html

Also, in reference to searching "Where StreetAddress Like '%whatever%'". This is all well and good if you're doing a quick search for your own benefit, but when you come to attempt to automate parts of your system that rely on address data or even attempt to drop duplicates, provide users with auto-suggest etc etc, performance is degraded to a point that it will become unusable the larger the address table.

If invalid addresses aren't a worry that is going to cost the company real cash, then it's not an issue - but then, if you're not using the addresses for anything that is beneficial financially (or likely to be in the future), then why are you storing that information in the first place?

@Snorfus Ah, you must be in the Prairies. I had overlooked including posting about land descriptions in my blog post but it's something I'm considering for a later post.

Legal subdivisions (LSDs) are used primarly in Oil & Gas and other primary resource industries in Alberta, Saskatchewan and Manitoba (although they are found in parts of B.C. too, they're not in such prevalent use). They all take the same format: Section, Township, Range, Meridian. For example:

SE 28-12-17-W5

This is the South Eastern corner of Section 28, Township 12, Range 17, West of the 5th Meridian.

You could simply use a single field and parse it with regular expressions or break it out into separate fields containing the breakdown of the LSD. Running regexes in SQL Server can be a pain when it comes to performance. My take on it is the same as that of address data in general, that because each piece of data is a separate unique piece of data that they should be stored in separate fields. However, given that the large majority of this type of address data is not used by the general public in lieu of a street address, I might recommend designing something that would allow this information to be separated from (but linked to) your main address data. Given however that the land description/LSD is also part of every Canadian address, I might be tempted to store it in my main address table depending on the target audience of the database.

Here's a post about the breakdown of the Alberta Land Resource System:

http://www1.agric.gov.ab.ca/%24department/deptdocs.nsf/all/agdex10302

One thing you will often find in Oil & Gas at least (which is where the bulk of my experience comes from) is that workers will often refer to only the first two parts of the LSD - i.e. 28 of 12, or 43 of 16. The remainder of the LSD is implied by the locality of the address - i.e. Grand Prairie, Fox Creek, Wolf Lake etc.



Related Topics



Leave a reply



Submit