How Important Are Lookup Tables

How important are lookup tables?

The answer depends a little if you are limited to freeware such as PostGreSQL (not fully SQL compliant), or if you are thinking about SQL (ie. SQL compliant) and large databases.

In SQL compliant, Open Architecture databases, where there are many apps using one database, and many users using different report tools (not just the apps) to access the data, standards, normalisation, and open architecture requirements are important.

Despite the people who attempt to change the definition of "normalisation", etc. to suit their ever-changing purpose, Normalisation (the science) has not changed.

  • if you have data values such as {Open; Closed; etc} repeated in data tables, that is data duplication, a simple Normalisation error: if you those values change, you may have to update millions of rows, which is very limited design.

    • Such values should be Normalised into a Reference or Lookup table, with a short CHAR(2) PK:

      O  Open
      C Closed
      U [NotKnown]
    • The data values {Open;Closed;etc} are no longer duplicated in the millions of rows. It also saves space.

    • the second point is ease of change, if Closed were changed to Expired, again, one row needs to be changed, and that is reflected in the entire database; whereas in the un-normalised files, millions of rows need to be changed.

    • Adding new data values, eg. (H,HalfOpen) is then simply a matter of inserting one row.

  • in Open Architecture terms, the Lookup table is an ordinary table. It exists in the [SQL compliant] catalogue; as long as the FOREIGN KEY relation has been defined, the report tool can find that as well.

  • ENUM is a Non-SQL, do not use it. In SQL the "enum" is a Lookup table.

  • The next point relates to the meaningfulness of the key.

    • If the Key is meaningless to the user, fine, use an {INT;BIGINT;GUID;etc} or whatever is suitable; do not number them incrementally; allow "gaps".
    • But if the Key is meaningful to the user, do not use a meaningless number, use a meaningful Relational Key.
  • Now some people will get in to tangents regarding the permanence of PKs. That is a separate point. Yes, of course, always use a stable value for a PK (not "immutable", because no such thing exists, and a system-generated key does not provide row uniqueness).

    • {M,F} are unlikely to change

    • if you have used {0,1,2,4,6}, well don't change it, why would you want to. Those values were supposed to be meaningless, remember, only a meaningful Key need to be changed.

    • if you do use meaningful keys, use short alphabetic codes, that developers can readily understand (and infer the long description from). You will appreciate this only when you code SELECT and realise you do not have to JOIN every Lookup table. Power users too, appreciate it.

  • Since PKs are stable, particularly in Lookup tables, you can safely code:

    WHERE status_code = 'O' -- Open

    You do not have to JOIN the Lookup table and obtain the data value Open, as a developer, you are supposed to know what the Lookup PKs mean.

Last, if the database were large, and supported BI or DSS or OLAP functions in addition to OLTP (as properly Normalised databases can), then the Lookup table is actually a Dimension or Vector, in Dimension-Fact analyses. If it was not there, then it would have to be added in, to satisfy the requirements of that software, before such analyses can be mounted.

  • If you do that to your database from the outset, you will not have to upgrade it (and the code) later.

Your Example

SQL is a low-level language, thus it is cumbersome, especially when it comes to JOINs. That is what we have, so we need to just accept the encumbrance and deal with it. Your example code is fine. But simpler forms can do the same thing.

A report tool would generate:

SELECT p.*,
s.name
FROM posts p,
status s
WHERE p.status_id = s.status_id
AND p.status_id = 'O'

Another Exaple

For banking systems, where we use short codes which are meaningful (since they are meaningful, we do not change them with the seasons, we just add to them), given a Lookup table such as (carefully chosen, similar to ISO Country Codes):


Eq Equity
EqCS Equity/Common Share
OTC OverTheCounter
OF OTC/Future

Code such as this is common:

WHERE InstrumentTypeCode LIKE "Eq%"

And the users of the GUI would choose the value from a drop-down that displays

{Equity/Common Share;Over The Counter},

not {Eq;OTC;OF}, not {M;F;U}.

Without a lookup table, you can't do that, either in the apps, or in the report tool.

Why is using a common-lookup table to restrict the status of entity wrong?

What Anith Sen is advising against is having a single lookup table for all lookup codes. That is the significance of the category column in his example. Having a separate table for each category is definitely the way to go.

This is because:

  1. we can use the look-up tables to restrict values through enforce foreign keys
  2. it makes it easier for the database to optimize queries which join data tables with look-up tables
  3. it scales better: one large lookup category can really skew performance

In your examples JobStatus and OrderStatus are separate categories. applicable to separate entities. That is why they need different look-up tables. There is not even a problem with sharing the same code table across several different data tables. Where it becomes problematic is when we have separate data tables (entities) for which some statuses are not appropriate: that is the time to split the codes out into separate look-up tables.

edit

I see you have edited your post to cite all of Anith's points. I think the most important point one is the first one, regarding constraints. If you want to restrict the ORDERS.STATUS column to have values from the OrderStatus category then you have to have a separate table to enforce a foreign key. Your alternatives are:

  • include a CodeCategory column on the ORDERS table and enforce a compound foreign key against the common CODES table, which now needs a unique key of (Category,Code).
  • duplicate the OrderStatus values in a check constraint
  • don't enforce the values in the database, and rely on the application's drop-down list to restrict the values.

All of those options suck, from the database perspective.

What is a lookup table?

What you have there is called a junction table. It is also known as:

  • cross-reference table
  • bridge table
  • join table
  • map table
  • intersection table
  • linking table
  • link table

But I've never seen the term "lookup table" used for this purpose.

Should I use a code in my lookup table

Summarizing all of the answers, I think there are four alternatives for the lookup table:

Alternative 1:
• Description (primary key, a longer varchar2 column)

Alternative 2:
• Code (primary key, a short varchar2 column)

• Description (not null, a longer varchar2 column)

Alternative 3:
• Id (a meaningless primary key, an integer value derived from a sequence)

• Description (not null, a longer varchar2 column)

Alternative 4:
• Id (a meaningless primary key, an integer value derived from a sequence)

• Code (unique key, a short varchar2 column)

• Description (not null, a longer varchar2 column)

The primary key column will be in the main table with a foreign key constraint on top.

Some characteristics per alternative:

Alternative 1:
• No join is required when querying the main table

• Clear meaning when doing ad-hoc queries on main table

• Requires more storage for main table

• Index on main table will be much bigger than in other alternatives

• Updating a Description value means maintenance trouble and possibly application downtime.

Alternative 2:
• Join is required when you want to retrieve the description value

• Join in not required if you want to filter on certain lookup values: you can use the code value for that.

• Pretty clear meaning when doing ad-hoc queries on main table

• Minimal additional storage requirements for main table

• Index on main table will be small.

• Updating a Description value is easy, however code is usually an abbreviation from the description. When updating a Description value, the code can become confusing.

Alternative 3:
• Join is required when you want to retrieve the description value

• When filtering on certain lookup values, you'd have to use the Description values in your queries as the Id's are meaningless.

• Meaning is not clear when doing ad-hoc queries on main table

• Minimal additional storage requirements for main table

• Index on main table will be small.

• Updating a Description value is easy and doesn't cause confusion as with Code values

Alternative 4:
• Join is required when you want to retrieve the description value

• Join is required when filtering on certain lookup values, you would use the Code value in the lookup table.

• Meaning is not clear when doing ad-hoc queries on main table

• Minimal additional storage requirements for main table

• Index on main table will be small

• Updating a Description value is easy and you can also very easily update the Code value as well to make it resemble the Description value. You may have to revisit some of your code when doing this, though.

Personal opinion:

I would look at how I plan to use the main table and the lookup table. Which queries will be important and have to run efficiently? Will the values ever change?

My personal choice would be alternative 2 or 4. I'd use alternative 2 if I was absolutely sure that the code value can never change. And this is rare. Country codes change, social security numbers change. Currency codes change, etcetera. So, most of the time, I'd choose alternative 4. I'd not be so concerned for an extra join, especially because the lookup table is a small table.

But: choose an alternative that suits your requirements.

Please feel free to edit the text when you know some more characteristics of an alternative.

Regards,

Rob.

Disadvantage of combined lookup table in mySQL vs individual lookup tables

Your 2nd lookup schema is absolutely unuseful.

You refer to a table by its name/index. But you cannot use this relation directly (tablename cannot be parametrized), you need to build conditional joining expression or use dynamic SQL. This is slower.

Your lookup table is reversable, i.e. the same reference may be written by 2 ways. Of course, you may add CHECK constraint like CHECK table_A < table_B (additionally it avoids self-references), but this again degrades the performance.

Your lookup does not prevent non-existent relations (for example, class and language are not related but nothing prevents to create a row for such relation). Again, additional constraint and decreased performance.

There are more disadvantages... but I'm too lazy to list them all.

Another very important point: Foreign key constraints assuring referential integrity cannot be used in the "combined lookup" approach. They needed to be simulated by complex and error prone triggers. Overall the "combined lookup" approach is just a horrible idea. – sticky bit


There is a rule - non-relational relations must be separated.


In the 1st scheme - does a student may study in more than one class at the same time? If not then you do not need in student_class lookup table, and class_id is an attribute in student table.

Lookup tables implementation - one table or separate tables

I suggest that :

A. Follow the organization standard if this is an enterprise system (some may laugh loud on this, I know). If such a thing exists, it would certainly promote individual tables.

B. Use Enums or 1 aggregated lookup table for programming level lookups only (such as error messages, etc,) if you must only. Any lookup data for business related data should be (in my opinion) be in a separate table for the following reasons at least:

  1. When you have separate tables, you need to use the correct table name when you join and not use a code column of the reference table. This makes writing queries less error prone. Writing "Select ... Where (TableID=12 and State="NY") AND (TableId=133 and Country="USA")"...style of coding is quite error prone during development. This is the major issue for me from coding perspective.

  2. RI errors on inserts and updates may be ambiguous when there is more 1 than reference to the lookup in the row being inserted or updated.

  3. In some cases, the a lookup table may have self references (relationships). For example, a Geographical location can be described as a hierarchy which would add more confusion to the model.

  4. The relationships (references) could loose meaning in your database. You will find that almost every table in your system is linked to this one table. It some how will not make sense.

  5. If you ever decided to allow the user to perform ad-hoc reporting, it would be difficult for them to use codes for lookup tables instead of names.

  6. I feel that the 1 table approach breaks Normalization concepts - Can't prove it now though.

An disadvantage, is that you may need to build an indexes on PKs and FKs for some (or all) of the separate tables. However, in the world of powerful database available today, this may not be a big deal.

There are plenty of discussion in the net that I should have read before answering your question, however I present some of the links if you care to take a look at some yourself:

Link 1, Link 2, Link 3, Link 4, Link 5...

Database Design - Multiple Lookup/Enum Tables or One Large Table?

Without knowing more about the application or requirements I would recommend having one table for each code type. IMO the database design would be more clear and self documenting to have foreign keys for each type of code you have.



Related Topics



Leave a reply



Submit