Designing SQL Database to Represent Oo Class Hierarchy

Designing SQL database to represent OO class hierarchy

In general I prefer obtion "B" (i.e. one table for base class and one table for each "concrete" subclass).

Of course this has a couple of drawbacks: first of all you have to join at least 2 tables whenever you have to read a full instance of a subclass. Also, the "base" table will be constantly accessed by anyone who has to operate on any kind of note.

But this is usually acceptable unless you have extreme cases (billions of rows, very quick response times required and so on).

There is a third possible option: map each subclass to a distinct table. This helps partitioning your objects but costs more in development effort, in general.

See this for a complete discussion.

(Regarding your "C" solution, using VARIANT: I can't comment on the merits/demerits, because it looks like a proprietary solution - what is it ? Transact-SQL? and I am not familiar with it).

How do you effectively model inheritance in a database?

There are several ways to model inheritance in a database. Which you choose depends on your needs. Here are a few options:

Table-Per-Type (TPT)

Each class has its own table. The base class has all the base class elements in it, and each class which derives from it has its own table, with a primary key which is also a foreign key to the base class table; the derived table's class contains only the different elements.

So for example:

class Person {
    public int ID;
    public string FirstName;
    public string LastName;
}

class Employee : Person {
    public DateTime StartDate;
}

Would result in tables like:

table Person
------------
int id (PK)
string firstname
string lastname

table Employee
--------------
int id (PK, FK)
datetime startdate

Table-Per-Hierarchy (TPH)

There is a single table which represents all the inheritance hierarchy, which means several of the columns will probably be sparse. A discriminator column is added which tells the system what type of row this is.

Given the classes above, you end up with this table:

table Person
------------
int id (PK)
int rowtype (0 = "Person", 1 = "Employee")
string firstname
string lastname
datetime startdate

For any rows which are rowtype 0 (Person), the startdate will always be null.

Table-Per-Concrete (TPC)

Each class has its own fully formed table with no references off to any other tables.

Given the classes above, you end up with these tables:

table Person
------------
int id (PK)
string firstname
string lastname

table Employee
--------------
int id (PK)
string firstname
string lastname
datetime startdate

How to model an object oriented design in a database?

Clearly, your option 1 is the worst choice. The main problem here is that you need to duplicate all rows of B and C in A, thus creating a severe maintenance problem.

Your option 2 is called the Single Table Inheritance pattern, which is recommended in cases where subtables do not have (m)any additional columns.

Your option 3 is called the Joined Tables Inheritance pattern where subtables (representing subclasses) are joined to their supertable via their primary key being also a foreign key referencing the supertable.

So, for the case of your abstract example, it seems that option 2 is the recommended approach, since your tables B and C have only one additional column.

Notice that according to the Joined Tables Inheritance approach, there is no need to add a primary key attribute (as you did with id in B and C). You just use the same primnary key as the supertable and make it also a foreign key referencing the supertable. So, in this approach, the schema would be

A { id PK, 
    a, b, c, d 
}
B { id PK REFERENCES A, 
    e 
}
C { id PK REFERENCES A, 
    f 
}

You can read more about this in the section Subtyping and Inheritance with Database Tables of my tutorial about developing front-end web applications with class hierarchies.

How to model a hierarchy of (sub)categories in a Class Diagram and in an SQL Table Model?

A hierarchy (or tree) of categories is modeled with a recursive one-to-many association, which associates the Category class with itself, as shown in the following diagram:

Sample Image

Such an Information Design Model can be transformed into an OOP Class Model and into an SQL Table Model, like so:

Sample Image

Notice how the superCategory column is designated as a foreign key referencing the categories table in the table model: by a UML dependency arrow stereotyped as «fk».

I hope you can figure out yourself how to code the SQL Table Model in an SQL Create Table statement.

p.s.: For more about how to model databases with UML CLass Diagrams seee https://stackoverflow.com/a/21394401/2795909

Does Stack Exchange's database schema follow good practice?

This is a classic case of so-called "Object-relational impedance mismatch". Specifically, you are taking about mapping OO's inheritance into a relational database structure. There are several common ways of doing that -

A table per subclass,
A table per leaf subclass, and
A table per class hierarchy (with a discriminator)

Each of these strategies is perfectly valid. Moreover, the structures could be mixed as needed.

It looks like Stack Exchange used a table per class hierarchy approach, with PostTypeId serving as a discriminator. This approach is as valid as any other approach that they could have taken. It is also one of the simplest ones to take from the maintenance standpoint, because it lets you construct manual queries with less work.

There is another thing in the structure of the table that you did not mention: it is not normalized. Specifically, there are AnswerCount and CommentCount fields that store information that could be obtained by aggregating the table (i.e. running a SELECT COUNT(*) FROM ... WHERE ... AND other.ParentId = p.Id ...) This is a common tradeoff between normalization and speed of execution: most likely, the profiling has indicated that the aggregation takes significant amount of time, so the counts have been moved into the "parent" record.

How can you represent inheritance in a database?

@Bill Karwin describes three inheritance models in his SQL Antipatterns book, when proposing solutions to the SQL Entity-Attribute-Value antipattern. This is a brief overview:

Single Table Inheritance (aka Table Per Hierarchy Inheritance):

Using a single table as in your first option is probably the simplest design. As you mentioned, many attributes that are subtype-specific will have to be given a NULL value on rows where these attributes do not apply. With this model, you would have one policies table, which would look something like this:

+------+---------------------+----------+----------------+------------------+
| id   | date_issued         | type     | vehicle_reg_no | property_address |
+------+---------------------+----------+----------------+------------------+
|    1 | 2010-08-20 12:00:00 | MOTOR    | 01-A-04004     | NULL             |
|    2 | 2010-08-20 13:00:00 | MOTOR    | 02-B-01010     | NULL             |
|    3 | 2010-08-20 14:00:00 | PROPERTY | NULL           | Oxford Street    |
|    4 | 2010-08-20 15:00:00 | MOTOR    | 03-C-02020     | NULL             |
+------+---------------------+----------+----------------+------------------+

\------ COMMON FIELDS -------/          \----- SUBTYPE SPECIFIC FIELDS -----/

Keeping the design simple is a plus, but the main problems with this approach are the following:

When it comes to adding new subtypes, you would have to alter the table to accommodate the attributes that describe these new objects. This can quickly become problematic when you have many subtypes, or if you plan to add subtypes on a regular basis.
The database will not be able to enforce which attributes apply and which don't, since there is no metadata to define which attributes belong to which subtypes.
You also cannot enforce NOT NULL on attributes of a subtype that should be mandatory. You would have to handle this in your application, which in general is not ideal.

Concrete Table Inheritance:

Another approach to tackle inheritance is to create a new table for each subtype, repeating all the common attributes in each table. For example:

--// Table: policies_motor
+------+---------------------+----------------+
| id   | date_issued         | vehicle_reg_no |
+------+---------------------+----------------+
|    1 | 2010-08-20 12:00:00 | 01-A-04004     |
|    2 | 2010-08-20 13:00:00 | 02-B-01010     |
|    3 | 2010-08-20 15:00:00 | 03-C-02020     |
+------+---------------------+----------------+
                          
--// Table: policies_property    
+------+---------------------+------------------+
| id   | date_issued         | property_address |
+------+---------------------+------------------+
|    1 | 2010-08-20 14:00:00 | Oxford Street    |   
+------+---------------------+------------------+

This design will basically solve the problems identified for the single table method:

Mandatory attributes can now be enforced with NOT NULL.
Adding a new subtype requires adding a new table instead of adding columns to an existing one.
There is also no risk that an inappropriate attribute is set for a particular subtype, such as the vehicle_reg_no field for a property policy.
There is no need for the type attribute as in the single table method. The type is now defined by the metadata: the table name.

However this model also comes with a few disadvantages:

The common attributes are mixed with the subtype specific attributes, and there is no easy way to identify them. The database will not know either.
When defining the tables, you would have to repeat the common attributes for each subtype table. That's definitely not DRY.
Searching for all the policies regardless of the subtype becomes difficult, and would require a bunch of UNIONs.

This is how you would have to query all the policies regardless of the type:

SELECT     date_issued, other_common_fields, 'MOTOR' AS type
FROM       policies_motor
UNION ALL
SELECT     date_issued, other_common_fields, 'PROPERTY' AS type
FROM       policies_property;

Note how adding new subtypes would require the above query to be modified with an additional UNION ALL for each subtype. This can easily lead to bugs in your application if this operation is forgotten.

Class Table Inheritance (aka Table Per Type Inheritance):

This is the solution that @David mentions in the other answer. You create a single table for your base class, which includes all the common attributes. Then you would create specific tables for each subtype, whose primary key also serves as a foreign key to the base table. Example:

CREATE TABLE policies (
   policy_id          int,
   date_issued        datetime,

   -- // other common attributes ...
);

CREATE TABLE policy_motor (
    policy_id         int,
    vehicle_reg_no    varchar(20),

   -- // other attributes specific to motor insurance ...

   FOREIGN KEY (policy_id) REFERENCES policies (policy_id)
);

CREATE TABLE policy_property (
    policy_id         int,
    property_address  varchar(20),

   -- // other attributes specific to property insurance ...

   FOREIGN KEY (policy_id) REFERENCES policies (policy_id)
);

This solution solves the problems identified in the other two designs:

Mandatory attributes can be enforced with NOT NULL.
Adding a new subtype requires adding a new table instead of adding columns to an existing one.
No risk that an inappropriate attribute is set for a particular subtype.
No need for the type attribute.
Now the common attributes are not mixed with the subtype specific attributes anymore.
We can stay DRY, finally. There is no need to repeat the common attributes for each subtype table when creating the tables.
Managing an auto incrementing id for the policies becomes easier, because this can be handled by the base table, instead of each subtype table generating them independently.
Searching for all the policies regardless of the subtype now becomes very easy: No UNIONs needed - just a SELECT * FROM policies.

I consider the class table approach as the most suitable in most situations.

The names of these three models come from Martin Fowler's book Patterns of Enterprise Application Architecture.

Database Guy Asks: Object-Oriented Design Theory?

Be careful some of the design patterns literature.

There are are several broad species of class definitions. Classes for persistent objects (which are like rows in relational tables) and collections (which are like the tables themselves) are one thing.

Some of the "Gang of Four" design patterns are more applicable to active, application objects, and less applicable to persistent objects. While you wrestle through something like Abstract Factory, you'll be missing some key points of OO design as it applies to persistent objects.

The Object Mentor What is Object-Oriented Design? page has mich of you really need to know to transition from relational design to OO design.

Normalization, BTW, isn't a blanket design principle that always applies to relational databases. Normalization applies when you have update transactions, to prevent update anomalies. It's a hack because relational databases are passive things; you either have to add processing (like methods in a class) or you have to pass a bunch of rules (normalization). In the data warehouse world, where updates are rare (or non-existent) that standard normalization rules aren't as relevant.

Consequently, there's no "normalize like this" for object data models.

In OO Design, perhaps the most important rule for designing persistent objects is the Single Responsibility Principle.

If you design your classes to have good fidelity to real-world objects, and you allocate responsibilities to those classes in a very focused way, you'll be happy with your object model. You'll be able to map it to a relational database with relatively few complexities.

Turns out, that when you look at things from a responsibility point of view, you find that 2NF and 3NF rules fit with sound responsibility assignment. Unique keys still matter. And derived data becomes the responsibility of a method function, not a persistent attribute.

How can an object-oriented programmer get his/her head around database-driven programming?

Linq to SQL using a table per class solution:

http://blogs.microsoft.co.il/blogs/bursteg/archive/2007/10/01/linq-to-sql-inheritance.aspx

Other solutions (such as my favorite, LLBLGen) allow other models. Personally, I like the single table solution with a discriminator column, but that is probably because we often query across the inheritance hierarchy and thus see it as the normal query, whereas querying a specific type only requires a "where" change.

All said and done, I personally feel that mapping OO into tables is putting the cart before the horse. There have been continual claims that the impedance mismatch between OO and relations has been solved... and there have been plenty of OO specific databases. None of them have unseated the powerful simplicity of the relation.

Instead, I tend to design the database with the application in mind, map those tables to entities and build from there. Some find this as a loss of OO in the design process, but in my mind the data layer shouldn't be talking high enough into your application to be affecting the design of the higher order systems, just because you used a relational model for storage.

Designing SQL Database to Represent Oo Class Hierarchy