Single Fixed Table with Multiple Columns VS Flexible Abstract Tables

single fixed table with multiple columns vs flexible abstract tables

Certain issues need to be clarified and resolved before we can enter into a reasonable discussion.

Pre-requisite Resolution

  1. Labels

    In a profession that demands precision, it is important that we use precise labels, to avoid confusion, and so that we can communicate without having to use long-winded descriptions and qualifiers.

    What you have posted as FixedTables, is Unnormalised. Fair enough, it may be an attempt at Third Normal form, but in fact it is a flat file, Unnormalised (not "denormalised). What you have posted as AbstractTables is, to be precise, Entity-Attribute-Value, which is almost, but not quite, Sixth Normal form, and is therefore more Normalised than 3NF. Assuming it is done correctly, of course.

    • The Unnormalised flat file is not "denormalised". It is chock full of duplication (nothing has been done to remove repeating groups and duplicate columns or to resolve dependencies) and Nulls, it is a performance hog in many ways, and prevents concurrency.

    • In order to be Denormalised, it has to first be Normalised, and then the Normalisation backed off a little for some good reason. Since it is not Normalised in the first place, it cannot be Denormalised. It is simply Unnormalised.

    • It cannot be said to be denormalised "for performance", because being a performance hog, it is the very antithesis of performance. Well, they need a justification for the lack of formalised design], and "for performance" is it. Even the smallest formal scrutiny exposed the misrepresentation (but very few people can provide, so it remains hidden, until they get an outsider to address, you guessed it, the massive performance problem).

    • Normalised structures perform far better than Unnormalised structures. More normalised structures (EAV/6NF) perform better than less normalised structures (3NF/5NF).

    • I am agreeing with the thrust of OMG Ponies, but not their labels and definitions

    • rather than saying 'don't "denormalise" unless you have to', I am saying, 'Normalise faithfully, period' and 'if there is a performance problem, you have not Normalised correctly'.

  2. Wikipedia

    The entries for Normal Forms and Normalisation offer definitions that are incorrect; they confuse the Normal Forms; they are lacking regarding the process of Normalisation; and they give equal weight to absurd or questionable NFs which have been debunked long ago. The result is, Wikipedia adds to an already confused and rarely understood subject. So don't waste your time.

    However, in order to progress, without that reference posing a hindrance, let me say this.

    • The definition of 3NF is stable, and has not changed.
    • There is a lot of confusion of the NFs between 3NF and 5NF. The truth is that this is an area that progressed over the last 15 years; and many orgs, academics as well as vendors with their products with limitations, jumped to create a new "Normal Form" to validate their offerings. All serving commercial interests and academically unsound. 3NF in its original untampered state intended and guaranteed certain attributes.
    • The sum total is, 5NF is today, what 3NF was intended to be 15 years ago, and you can skip the commercial banter and the twelve or so "special" (commercial and pseudo-academic) NFs in-between, some of which are identified in Wikipedia, and even that in confusing terms.
  3. Fifth Normal Form

    Since you have been able to understand and implement the EAV in your post, you will have no problem understanding the following. Of course a true Relational Model is pre-requisite, strong keys, etc. Fifth Normal Form is, since we are skipping the Fourth:

    • Third Normal Form
      • which in simple definitive terms is, every non-key column in every table has a 1::1 relationship to the Primary Key of the table,
      • and to no other non-key columns
    • Zero data duplication (the result, if Normalisation is progressed diligently; not achieved by intelligence or experience alone, or by working toward it as a goal without the formal process)
    • no Update Anomalies (when you update a column somewhere, you do not have to update the same column located somewhere else; the column exists in one and only one place).
    • If you understand the above, 4NF, BCNF, and all the silly "NFs" can be dismissed, they are required for physicalised Record Filing Systems, as promoted by academics, quite foreign to the Relational Model (Codd).
  4. Sixth Normal Form

    • The purpose is elimination of missing data (attribute columns), aka elimination of Nulls
    • This is the one true solution to the Null Problem (also called Handling Missing Values), and the result is a database without Nulls. (It can be done at 5NF with standards and Null substitutes but that is not optimal.) How you interpret and display the missing values is another story.
    • Technically, is not a true Normal Form, because it does not have 5NF as a pre-requisite, but it has a value
  5. EAV vs Sixth Normal Form

    All the databases I have written, except one, are pure 5NF. I have worked with (administered, fixed up, enhanced) a couple of EAV databases, and I have implemented many true 6NF databases. EAV is a loose implementation of 6NF, often done by people who do not have a good grasp on Normalisation and the NFs, but who can see the value in, and need the flexibility of, EAV. You are a perfect example.

    The difference is this: because it is loose, and because implementers do not have a reference (6NF) to be faithful to, they only implement what they need, and they write it all in code; that ends up being an inconsistent model.

    Whereas, a pure 6NF implementation does have a pure academic reference point, and thus it is usually tighter, and consistent. Typically this shows up in two visible elements:

    • 6NF has a catalogue to contain metadata, and everything is defined in metadata, not code. EAV does not have one, everything is in code (implementers keep track of the objects and attributes). Obviously a catalogue eases the addition of columns, navigation, and allows utilities to be formed.
    • 6NF when understood, provides the true solution to The Null Problem. EAV implementers, since they are absent the 6NF context, handle missing data in code, inconsistently, or worse, allow Nulls in the database. 6NF implementers disallow Nulls, and handle missing Data consistently and elegantly, without requiring code constructs (for Null handling; you still have to code for missing data of course).

Eg. For 6NF databases with a catalogue, I have a set of procs that will [re]generate the SQL required to perform all SELECTs, and I provide Views in 5NF for all users, so they do not need to know or understand the underlying 6NF structure. They are driven off the catalogue. Thus changes are easy and automated. EAV types do that manually, due to the absence of the catalogue.

Discussion

Now, we can start the discussion.

"Of course it can be more abstract if
value's are predefined (Example:
specialities could have their own
list)"

Sure. But do not get too "abstract". Maintain consistency and implement such lists in the same EAV (or 6NF) manner as other lists.

"If I take the abstract approach it
can be very flexible, but queries will
be more complex with a lot of joins.
But I don't know if this affects the
performance, executing these 'more
complex' queries."

  1. Joins are pedestrian in relational databases. The problem is not the database, the problem is that SQL is cumbersome when handling joins, especially compound keys.

  2. EAV and 6NF databases have more Joins, which just as pedestrian, no more, no less. If you have to code each SELECT manually, sure, the cumbersome gets really cumbersome.

  3. The entire problem can be eliminated by (a) going with 6NF over EAV and (b) implementing a catalogue, from which you can (c) generate all the basic SQL. Eliminates an entire class of errors as well.

  4. It is a common myth that Joins somehow have a cost. Totally false.

    • The join is implemented at compile time, there is nothing of substance to 'cost' CPU cycles.
    • The issue is the size of tables being joined, not the cost of the Join between those same tables.
    • Joining two tables with millions of rows each, on a correct PK⇢FK relation, each of which have the appropriate indices

      (Unique on the parent [PK] side; Unique on the Child side [PK=parent FK + something]

      is instantaneous
    • Where the Child index is not unique, but at least the leading columns are valid, it is slower; where there is no useful index, of course it is very slow.
    • None of it has to do with Join cost.
    • Where many rows are returned, the bottleneck will be the network and the disk layout; not the join processing.
  5. Therefore you can get as "complex" as you like, there is no cost, SQL can handle it.

I would be interested to know what are
the up and downsides of both methods.
I can just imagine for myself, but I
don't have the experience to confirm
this.

  1. 5NF (or 3NF for those who have not made the progression) is the easiest and best, in terms of implementation; ease of use (developers as well as users); and maintenance.

    • The drawback is, every time you add a column, you have to change the database structure (table DDL). That is fine is some cases, but not in most cases, due to change control in place, quite onerous.
    • Second, you have to change existing code (code handling the new column does not count, because that is an imperative): where good standards are implemented, that is minimised; where they are absent, the scope is unpredictable.
  2. EAV (which is what you have posted), allows columns to be added without DDL changes. That is the single reason people choose it. (code handling the new column does not count, because that is an imperative). If implemented well, it will not affect existing code; if not, it will.

  3. But you need EAV-capable developers.

    • When EAV is implemented badly, it is abominable, a worse mess than 5NF done badly, but not any worse than Unnormalised which is what most databases out there are (misrepresented as "denormalised for performance").
    • Of course, it is even more important (than in 5NF/3NF) to hold a strong Transaction context, because the columns are far more distributed.
    • Likewise, it is essential to retain Declarative Referential Integrity: the messes I have seen were due in large part to the developers removing DRI because it became "too hard to maintain", the result was, as you can imagine, one mother of a data heap with duplicate 3NF/5NF rows and columns all over the place. And inconsistent Null handling.
  4. There is no difference in performance, assuming that the server has been reasonably configured for the intended purpose. (Ok, there are specific optimisations that are possible only in 6NF, which are not possible in other NFs, but I think that is outside the scope of this thread.) And again, EAV done badly can cause unnecessary bottlenecks, no more so than Unnormalised.

  5. Of course, if you go with EAV, I am recommending more formality; buy the full quid; go with 6NF; implement a catalogue; utilities to produce SQL; Views; handle Missing Data consistently; eliminate Nulls altogether. This reduces your vulnerability to the quality of your developers; they can forget about the EAV/6NF esoteric issues, use Views, and concentrate on the app logic.

Database: Storing multiple Types in single table or multiple intermediate tables for Delta Tables

Good database design has separate tables for different concepts. Using the same database column to hold different types of data will lead to code which is harder to understand, prone to data corruption and less performative.

You may think it's only two tables and a few tens of thousands of rows, so does it matter? But that is only your current requirement. What you choose now will set the template for what happens when (say) you need to add telephone numbers to the process.


Now in future if we get 5 more entities to update

Do you mean "entities", like say Customers rather than Employees? Or do you really mean "attributes" as in my example of Employee Telephone Number?

Generally speaking we have a separate table for distinct entities, and all the attributes of that entity are grouped at the same cardinality. To take your example, I would expect an Employee to have one UserID and one Email Address so I would design the table like this:

Employee_audit
EmpId|UserID|EmailID|Updatedate

That is, I have one record which stores the complete state of the Employee record at the Updatedate.

If we add a new entity, Customers then we have a new table. Simple. But a new attribute like Employee Phone Number offers a choice, because an employee can have more than one: work landline, mobile, fax, home, etc. So we could represent this in three ways: a child table with a type column, multiple child tables for each type, or as distinct columns on the Employee record.

For the main Employee table I would choose the separate table (or tables, depending on whether I'm shooting for 6NF). But for an audit table I would choose one record per Employee and pivot the phone numbers like this:

Employee_audit
EmpId|UserID|EmailID|Landline|Mobile|Fax|Home|Updatedate

The one thing I would never do is have a single table with type and value columns. It seems attractive because it means we could track additional entities without any further DDL. But in fact it becomes harder to re-assemble the complete state of an Employee at any given time with each attribute we add. Also it means the auditing process itself is more complicated (because it needs to determine which attributes have changed and whether it needs to audit the change) and more expensive (because changing three attributes on the same record entails inserting three audit records).

When two tables are very similar, when should they be combined?

Comments, PhotoComments, and EventComments are related in a pattern called "generalization specialization". This pattern is handled by simple inheritance in object oriented languages. It's a little more intricate to set up a schema of tables that will capture the same pattern.

But it's well understood. A quick google search on "generalization specialization relational modeling" will give you several good articles on the subject.

Data model, saving product properties in a separate table?

If you want to have a normalized schema, you could have a product_type table listing all these properties a product can have.

product_type table:

  • type_id
  • type_name
  • type_property_1
  • type_property_2
  • etc.

and the product table:

  • product_id
  • type_id
  • etc.

The benefit of having a separate type table is that it will be probably small hence it will be relatively easier to make changes like adding new properties in the future. Also it will save disk space.

The benefit of not having separate type table is you do not have to join two tables just to list products with properties, that is the query performance will be better.

Storing entities with dynamic set of properties in a table (and using fixed columns or key-value tables)

What Database are you using.

One solution (if your DB supports) could be storing the person's information in an XML Format.

Database design with multiple nested tables

It looks to me like your trying to implement a sort of EAV (Entity-Attribue-Value) design.

Your table seems ok, but this design inherently requires complicated SQL.

There are different methods of doing this but based on your tales above I would go with something like this.

   Subject --< Process --< RelationshipProcessProperty >-- Property

Your property would simply look like this:

    "Property"
PK PropertyId
Name

Your RelationshipProcessProperty could look like this:

    "RelationshiipProcessProperty"
PK RelationshipProcessProperty
FK Process
FK Property
Value

Your SQL is where it would get complicated. Doing a 'generic' design like this has it's impliations as you're looking for multiple values in the same table.

    ; with Property1 as(
SELECT
proc.Id as ProcessId,
prop.Name,
rrp.Value
FROM Subject s
LEFT JOIN Process proc
ON s.SubjectId = proc.SubjectId
LEFT JOIN RelationshipProcessProperty rpp
on proc.ProcessId = rpp.ProcessId
LEFT JOIN Property prop
on rpp.PropertyId = prop.PropertyId
WHERE
s.Name = "Subject1"
AND
proc.Name = "Process1"
AND
prop.Name = "Property1"
)

, Property2 as(
SELECT
proc.Id as ProcessId,
prop.Name,
rrp.Value
FROM Subject s
LEFT JOIN Process proc
ON s.SubjectId = proc.SubjectId
LEFT JOIN RelationshipProcessProperty rpp
on proc.ProcessId = rpp.ProcessId
LEFT JOIN Property prop
on rpp.PropertyId = prop.PropertyId
WHERE
s.Name = "Subject1"
AND
proc.Name = "Process1"
AND
prop.Name = "Property2"
)

SELECT
p1.Name,
p1.Value,
p2.Name,
p2.Value
FROM
Property1 p1
LEFT JOIN Property2 p2
on p1.ProcessId = p2.ProcessId

You can use this method to get multiple properties for the same process.

In order to have specified properties for a specified process, you would need to create Process Type Table:

    "ProcessType"
PK ProcessType
Type

And this does mean that you will need to add a foreign key to your process table to link it to which type it is. You can then link your ProcessType table to your Property table with a relationship table that defines all of the available types.

    "EligibleProcessProperties"
PK EligibleprocessPropertiesId
FK ProcessType
Fk Property

Then to find out all of the available properties to that process type, you would have a relatively simple query

    SELECT
p.Name
FROM
ProcessType pt
LEFT JOIN EligibleProcessProperties epp
on pt.ProcessTypeId = epp.ProcessTypeId
LEFT JOIN Property p
on epp.PropertyId = p.PropertyId
WHERE
pt.Type = "Type1"

I think that this is the sort of thing you're looking for (though i could be completely off). If this is what you're looking for, there's a really good post here that makes some good points.

Also, I'm almost 100% there are better ways to do my long ';with' query - but this is all I know. Hopefully someone else can provide a better one. The point is that with this design, you will need sub-queries one way or another.

Multiple columns or single textfield?

The common way to do this is to use foreign_key and model fields in your printouts table (more details throughout this page):

http://book.cakephp.org/2.0/en/models/associations-linking-models-together.html#associations-linking-models-together

'foreign_key' is the id of the "other" model, and 'model' is the name of the other model.

So your Printout belongsTo all 3 other models and in the associations, you specify the condition of 'Printout.model' => 'Room' (or whatever model).

MySQL performance: multiple tables vs. index on single table and partitions

Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.

I called this syndrome Metadata Tribbles in my book SQL Antipatterns Volume 1. You see this happen every time you plan to create a "table per X" or a "column per X".

This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.

There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?

Instead, I'd recommend you use MySQL Partitioning.

Here's an example of partitioning the table:

CREATE TABLE statistics (
id INT AUTO_INCREMENT NOT NULL,
user_id INT NOT NULL,
PRIMARY KEY (id, user_id)
) PARTITION BY HASH(user_id) PARTITIONS 101;

This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.

For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:

mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: statistics
partitions: p1 <--- this shows it touches only one partition
type: index
possible_keys: NULL
key: PRIMARY
key_len: 8
ref: NULL
rows: 2
Extra: Using where; Using index

The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.

You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.

It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.


Re your questions in comment:

How could I determine a resonable number of partitions?

For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.

So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?

Shouldn't the amount of partitions grow over time? If so: How can I automate that?

The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.

That said, you can re-partition a table with ALTER TABLE:

ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;

This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.

You may want to monitor the size of data and indexes in partitions:

SELECT table_schema, table_name, table_rows, data_length, index_length
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE partition_method IS NOT NULL;

Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.

If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html

I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.



Related Topics



Leave a reply



Submit