Maintaining Subclass Integrity in a Relational Database

Maintaining subclass integrity in a relational database

Here are a couple of possibilities. One is a CHECK in each table that the student_id does not appear in any of the other sister subtype tables. This is probably expensive and every time you need a new subtype, you need to modify the constraint in all the existing tables.

CREATE TABLE athletes (
  student_id INT NOT NULL PRIMARY KEY,
  FOREIGN KEY (student_id) REFERENCES students(student_id),
  CHECK (student_id NOT IN (SELECT student_id FROM musicians 
                      UNION SELECT student_id FROM slackers 
                      UNION ...)) 
);

edit: @JackPDouglas correctly points out that the above form of CHECK constraint is not supported by Microsoft SQL Server. Nor, in fact, is it valid per the SQL-99 standard to reference another table (see http://kb.askmonty.org/v/constraint_type-check-constraint).

SQL-99 defines a metadata object for multi-table constraints. This is called an ASSERTION, however I don't know any RDBMS that implements assertions.

Probably a better way is to make the primary key in the students table a compound primary key, the second column denotes a subtype. Then restrict that column in each child table to a single value corresponding to the subtype represented by the table. edit: no need to make the PK a compound key in child tables.

CREATE TABLE athletes (
  student_id INT NOT NULL PRIMARY KEY,
  student_type CHAR(4) NOT NULL CHECK (student_type = 'ATHL'),
  FOREIGN KEY (student_id, student_type) REFERENCES students(student_id, student_type)
);

Of course student_type could just as easily be an integer, I'm just showing it as a char for illustration purposes.

If you don't have support for CHECK constraints (e.g. MySQL), then you can do something similar in a trigger.

I read your followup about making sure a row exists in some subclass table for every row in the superclass table. I don't think there's a practical way to do this with SQL metadata and constraints. The only option I can suggest to meet this requirement is to use Single-Table Inheritance. Otherwise you need to rely on application code to enforce it.

edit: JackPDouglas also suggests using a design based on Class Table Inheritance. See his example or my examples of the similar technique here or here or here.

Maintaining Integrity

The answer is that you should use an input to the stored procedure that is unambiguous, like the employee id, not the employee name. In a UI, you'd allow the user to choose a user by name, including enough information -- like email, office number, etc. -- to allow them to choose the correct one. Your program, however, would use the id of the selected employee when calling the stored procedure.

How to Implement Referential Integrity in Subtypes

None of that is necessary, especially not the doubling up the tables.

Introduction

Since the Standard for Modelling Relational Databases (IDEF1X) has been in common use for over 25 years (at least in the high quality, high performance end of the market), I use that terminology. Date & Darwen, ~~despite~~¹ consistent with the great work they have done to ~~progress~~suppress the Relation Model, they were unaware of IDEF1X until I brought it to their attention in 2009, and thus has a new terminology² for the Standard terminology that we have been using for decades. Further, the new terminology does not deal with all the cases, as IDEF1X does. Therefore I use the established Standard terminology, and avoid new terminology.

Even the concept of a "distributed key" fails to recognise the underlying ordinary PK::FK Relations, their implementation in SQL, and their power.
The Relational, and therefore IDEF1X, concept is Identifiers and Migration thereof.
Sure, the vendors are not exactly on the ball, and they have weird things such a "partial Indices" etc, which are completely unnecessary when the basics are understood. But famous “academics” and “theoreticians” coming up with incomplete new concepts when the concept was standardised and give full treatment 25 years ago ... that, is unexpected and unacceptable.

Caveat

IEC/ISO/ANSI SQL barely handles Codd’s 3NF (Date & Darwen’s “5NF”) adequately, and it does not support Basetype-Subtype structures at all; there are no Declarative Constraints for this (and there should be).

Therefore, in order to enforce the full set of Rules expressed in the Data Model, both Basetype::Subtype and Subtype::Basetype, we have to fiddle a little with CHECK CONSTRAINTs, etc (I avoid using Triggers for a number of reasons).

Relief

However, I take all that into account. In order for me to effectively provide a Data Modelling service on Stack Overflow, without having to preface that with a full discourse, I purposely provide models that can be implemented by capable people, using existing SQL and existing Constraints, to whatever extent they require. It is already simplified, and contains the common level of enforcement.

We can use both the example graphic in the linked document and your fully IDEF1X-compliant Sensor Data Model

Readers who are not familiar with the Relational Modelling Standard may find IDEF1X Notation useful. Readers who think a database can be mapped to objects, classes, and subclasses are advised that reading further may cause injury. This is further than Fowler and Ambler have read.

Implementation of Referential Integrity for Basetype-Subtype

There are two types of Basetype-Subtype structures.

Exclusive Subtype

Exclusive means there must be one and only one Subtype row for each Basetype row. In IDEF1X terms, there should be a Discriminator column in the Basetype, which identifies the Subtype row that exists for it.

For more than two Subtypes, this is demanded, and I implement a Discriminator column.
For two Subtypes, since this is easily derived from existing data (eg. Sensor.IsSwitch is the Discriminator for Reading), I do not model an additional explicit Discriminator column for Reading. However, you are free to follow the Standard to the letter and implement a Discriminator.

I will take each aspect in detail.

The Discriminator column needs a CHECK CONSTRAINT to ensure it is within the range of values, eg: IN ("B", "C", "D"). IsSwitch is a BIT, which is 0 or 1, so that is already constrained.
Since the PK of the Basetype defines its uniqueness, only one Basetype row will be allowed; no second Basetype row (and thus no second Subtype row) can be inserted.

Therefore it is overkill, completely redundant, an additional unnecessary Index, to implement an Index such as (PK, Discriminator) in the Basetype, as your link advises. The uniqueness is in the PK, and therefore the PK plus anything will be unique).
IDEF1X does not require the Discriminator in the Subtype tables. In the Subtype, which is again constrained by the uniqueness of its PK, as per the model, if the Discriminator was implemented as a column in that table, every row in it will have the same value for the Discriminator (every Book will be "B"; every ReadingSwitch will be an IsSwitch). Therefore it is absurd to implement the Discriminator as a column in the Subtype. And again, completely redundant, an additional unnecessary Index, to implement an Index such as (PK, Discriminator) in the Subtype: the uniqueness is in the PK, and therefore the PK plus anything will be unique).
The method identified in the link is a ham-fisted and bloated (massive data duplication for no purpose) way of implementing Referential Integrity. There is probably a good reason the author has not seen that construct anywhere else. It is a basic failure to understand SQL and to use it as it is effectively. These "solutions" are typical of people who follow a dogma "SQL can't do ..." and thus are blind to what SQL can do. The horrors that result from Fowler and Ambler's blind "methods" are even worse.

The Subtype PK is also the FK to the Basetype, that is all that is required, to ensure that the Subtype does not exist without a parent Basetype.

Therefore for any given PK, whichever Basetype-Subtype is inserted first will succeed; and whichever Basetype-Subtype is attempted after that, will fail. Therefore there is nothing to worry about in the Subtype table (a second Basetype row or a second Subtype row for the same PK is prevented).

.

The SQL CHECK CONSTRAINT is limited to checking the inserted row. We need to check the inserted row against other rows, either in the same table, or in another table. Therefore a 'User Defined' Function is required.

Write a simple UDF that will check for existence of the PK and the Discriminator in the Basetype, and return 1 if EXITS or 0 if NOT EXITS. You will need one UDF per Basetype (not per Subtype).
In the Subtype, implement a CHECK CONSTRAINT that calls the UDF, using the PK (which is both the Basetype and the Subtype) and the Discriminator value.
I have implemented this in scores of large, real world databases, on different SQL platforms. Here is the 'User Defined' Function Code, and the DDL Code for the objects it is based on.
This particular syntax and code is tested on Sybase ASE 15.0.2 (they are very conservative about SQL Standards compliance).
I am aware that the limitations on 'User Defined' Functions are different for every SQL platform. However, this is the simplest of the simple, and AFAIK every platform allows this construct. (No idea what the Non-SQLs do.)
yes, of course this clever little technique can be used implement any non-trivial data rule that you can draw in a Data Model. In particular, to overcome the limitations of SQL. Note my caution to avoid two-way Constraints (circular references).

Therefore the CHECK CONSTRAINT in the Subtype, ensures that the PK plus the correct Discriminator exists in Basetype. Which means that only that Subtype exists for the Basetype (the PK).

Any subsequent attempt to insert another Subtype (ie. break the Exclusive Rule) will fail because the PK+Discriminator does not exist in the Basetype.
Any subsequent attempt to insert another row of the same Subtype is prevented by the uniqueness of its PK Constraint.

The only bit that is missing (not mentioned in the link) is the Rule "every Basetype must have at least one Subtype" is not enforced. This is easily covered in Transactional code (I do not advise Constraints going in two directions, or Triggers); use the right tool for the job.

Non-exclusive Subtype

The Basetype (parent) can host more than one Subtype (child)

There is no single Subtype to be identified.

The Discriminator does not apply to Non-exclusive Subtypes.
The existence of a Subtype is identified by performing an existence check on the Subtype table, using the Basetype PK.

Simply exclude the CHECK CONSTRAINT that calls the UDF above.

The PRIMARY KEY, FOREIGN KEY, and the usual Range CHECK CONSTRAINTs, adequately support all requirements for Non-exclusive Subtypes.

Reference

For further detail; a diagrammatic overview including details; and the distinction between Subtypes and Optional Column tables, refer to this Subtype document.

Note

I, too, was taken in by C J Date's and Hugh Darwen's constant references to "furthering" the Relational Model. After many years of interaction, based on the mountain of consistent evidence, I have concluded that their work is in fact, a debasement of it. They have done nothing to further Dr E F Codd's seminal work, the Relational Model, and everything to damage and suppress it.
They have private definitions for Relational terms, which of course severely hinders any communication. They have new terminology for terms we have had since 1970, in order to appear that they have "invented" it.

Response to Comment

^{This section can be skipped by all readers who did not comment.}

Unfortunately, some people are so schooled in doing things the wrong way, at massive additional cost, that even when directed clearly in the right way, they cannot understand it. Perhaps that is why proper education cannot be substituted with a Question-and Answer format.

Sam:
I’ve noticed that this approach doesn't prevent someone from using UPDATE to change a Basetype's discriminator value. How could that be prevented? The FOREIGN KEY + duplicate Discriminator column in subtypes approach seems to overcome this.

Yes. This Method doesn't prevent someone using UPDATE to change a Key, or a column in some unrelated table, or headaches, either. It answers a specific question, and nothing else. If you wish to prevent certain DML commands or whatever, use the SQL facility that is designed for that purpose. All that is way beyond the scope of this question. Otherwise every answer has to address every unrelated issue.

Answer. Since we should be using Open Architecture Standards, available since 1993, all changes to the db are via ACID Transactions, only. That means direct INSERT/UPDATE/DELETE, to all tables are prohibited; the data retains Integrity and Consistency (ACID terminology). Otherwise, sure, you have a mess, such as your eg. and the consequences. The proponents of this method do not understand Transactions, they understand only single file INSERT/UPDATE/DELETE.

Further, the FK+Duplicate D+Duplicate Index (and the massive cost therein !) does nothing of the sort, I don't know where you got "seems" from.

dtheodor:
This question is about referential integrity. Referential integrity doesn't mean "check that the reference is valid on insert and the forget about it". It means "maintain the validity of the reference forever". The duplicate discriminator + FK method guarantees this integrity, your UDF approach does not. It's without question that UPDATEs should not break the reference.

The problem here is two-fold. First, you need basic education in other areas regarding Relational Databases and Open Architecture Standards. Again, it is best to open a new question here, so a complete answer to that other area of Relational Databases can be provided.

OK, short answer, that really belongs in another question How is the Discriminator in Exclusive Subtypes Protected from an Invalid UPDATE?

Clarity. Yes, Referential integrity doesn't mean "check that the reference is valid on insert and the forget about it”. I didn’t say that it meant that either.

Referential Integrity means the References in the database FOREIGN KEY has Integrity with the PRIMARY KEY that it references.
Declarative Referential Integrity means the declared References in the database ...
CONSTRAINT FOREIGN KEY ... REFERENCES ...

CONSTRAINT CHECK ...

are maintained by the RDBMS platform and not by the application code.
It does not mean "maintain the validity of the reference forever” either.

The original question regards RI for Subtypes, and I have answered it, providing DRI.

The point that massively inefficient structures and duplicated tables, are not required, must be emphasised.

Your question does not regard RI or DRI.
Your question, although asked incorrectly, because you are expecting the Method to provide what the Method does not provide, and you do not understand that your requirement is fulfilled by other means, is How is the Discriminator in Exclusive Subtypes Protected from an Invalid UPDATE ?
The answer is, use the Open Architecture Standards that we should be using since 1993. That prevents all invalid UPDATEs. Do please read the linked documents, and understand them, your concern is a non-issue, it does not exist. That is the short answer.
But you did not understand the short answer, so I will explain it here.

No one is allowed to walk up to the database and change a column here or a value there. Using either SQL directly or an app that uses SQL directly. If that were allowed, you will not have a secured database.
All updates (lower case) to the database (including multi-row INSERT/UPDATE/DELETE) are implemented as ACID SQL Transactions. And nothing but Transactions. The set of Transactions constitute the Database API, that is exposed to any application that uses the database.
- SQL has ACID Transactions. Non-SQL databases do not have Transactions. Proponents of these database systems know absolutely nothing about Transactions, let alone Open Architecture. Their Non-architecture is a monolithic stack. And a “database” that gets refactored every month.
Since the only Transactions that you write will insert the basetype+subtype in a single Transaction, as a single Logical Unit of Work, the Integrity (data Integrity, not Referential Integrity) of the basetype::subtype relation is maintained, and maintained within the database. Therefore all updates to the database will be Valid, there will not be any Invalid updates.
Since you are not so stupid as to write code that UPDATEs the Discriminator column in a single row without the attendant DELETE Previous_Subtype, place it in a Transaction, and GRANT EXEC permission for it to user ROLES, there will not be an Invalid Discriminator anywhere in the database.

Discovering the subclass of a row in the superclass table

"but that seems like it would be an
unnecessary amount of joins to find
out something so simple."

You got it.

I have been in your shoes, and I luckily still have this site bookmarked in case I needed to reference it again. It even includes information on how to set up the constraints you are looking for.

Assuming you're using SQL Server, here you go:
Implementing Table Inheritance in SQL Server

In a StackOverflow clone, what relationship should a Comments table have to Questions and Answers?

I'd go with the Posts approach. This is the best way to ensure referential integrity.

If you need additional columns for Answers and Questions respectively, put them in additional tables with a one-to-one relationship with Posts.

For example, in MySQL syntax:

CREATE TABLE Posts (
  post_id     SERIAL PRIMARY KEY,
  post_type   CHAR(1),              -- must be 'Q' or 'A'
  -- other columns common to both types of Post
  UNIQUE KEY (post_id, post_type) -- to support foreign keys
) ENGINE=InnoDB;

CREATE TABLE Comments (
  comment_id  SERIAL PRIMARY KEY, 
  post_id     BIGINT UNSIGNED NOT NULL,
  -- other columns for comments (e.g. date, who, text)
  FOREIGN KEY (post_id) REFERENCES Posts(post_id)
) ENGINE=InnoDB; 

CREATE TABLE Questions (
  post_id     BIGINT UNSIGNED PRIMARY KEY,
  post_type   CHAR(1),              -- must be 'Q'
  -- other columns specific to Questions
  FOREIGN KEY (post_id, post_type) REFERENCES Posts(post_id, post_type)
) ENGINE=InnoDB;

CREATE TABLE Answers (
  post_id     BIGINT UNSIGNED PRIMARY KEY,
  post_type   CHAR(1),              -- must be 'A'
  question_id BIGINT UNSIGNED NOT NULL,
  -- other columns specific to Answers
  FOREIGN KEY (post_id, post_type) REFERENCES Posts(post_id, post_type)
  FOREIGN KEY (question_id) REFERENCES Questions(post_id)
) ENGINE=InnoDB;

This is called Class Table Inheritance. There's a nice overview of modeling inheritance with SQL in this article: "Inheritance in relational databases."

It can be helpful to use post_type so a given Post can be only one answer or one question. You don't want both an Answer and a Question to reference one given Post. So this is the purpose of the post_type column above. You can use CHECK constraints to enforce the values in post_type, or else use a trigger if your database doesn't support CHECK constraints.

I also did a presentation that may help you. The slides are up at http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back. You should read the sections on Polymorphic Associations and Entity-Attribute-Value.

If you use Single Table Inheritance, as you said you're using Ruby on Rails, then the SQL DDL would look like this:

CREATE TABLE Posts (
  post_id     SERIAL PRIMARY KEY,
  post_type   CHAR(1),              -- must be 'Q' or 'A'
  -- other columns for both types of Post
  -- Question-specific columns are NULL for Answers, and vice versa.
) ENGINE=InnoDB;

CREATE TABLE Comments (
  comment_id  SERIAL PRIMARY KEY, 
  post_id     BIGINT UNSIGNED NOT NULL,
  -- other columns for comments (e.g. date, who, text)
  FOREIGN KEY (post_id) REFERENCES Posts(post_id)
) ENGINE=InnoDB;

You can use a foreign key constraint in this example, and I recommend that you do! :-)

Rails philosophy tends to favor putting enforcement of the data model into the application layer. But without constraints enforcing integrity at in the database, you have the risk that bugs in your application, or ad hoc queries from a query tool, can harm data integrity.

Maintaining Subclass Integrity in a Relational Database