Creating a Db Table Null Best Practices

Creating a db table NULL best practices

  1. Purely theoretically, a NULL is supposed to mean "unknown value". So - again, purely theoretically - you should design your tables when normalized so that you don't need to fill out NULL values to mean "not applicable for this row". However, this point has pretty much no relation to any practical consideration (design, performance, or query readability).

  2. Practically, there are some performance considerations. You should normalize away very sparse data in the following cases:

    • There is material benefit from shortening the table (both IO wise and/or space wise). NULLs do take space, and the wider the rows the worse the performance. This is especially true when the table has a LOT of rows and there are many such sparse columns. For smaller table with only 2 such columns the benefits realized might not be worth the trouble of having an extra join.

    • Your queries have the column in question in the WHERE clause. IIRC, querying on a heavily NULL-ed column is rather inefficient.

    • On the other hand, at certain point, having extra joins in the query might hurt the optimizer performance (at least it does so on Sybase once your joins have 10+ tables - from taking up CPU resources when optimizer runs to actually confusing the optimizer to pick a VERY bad plan). The solution is to avoid having too-many tables due to normalization (as in, don't bother splitting your 2 columns into a separate table), or forcing the query plan. The latter is obviously Bad Juju.

What is the best / correct way to deal with null values in a database when you have different types of related data

Consider the following.

  • My Favorite Dog Breed Is
    • Beagle
    • Basset Hound
    • Boxer
    • Other, please specify _____________

You can have an answer to a question option which also has a value.

Now consider that both your answer and your question option have their own references to the question. It is possible to have the answer and the answer's question options refer to different questions! This could be constrained with triggers, but it's better to eliminate the redundancy.

That redundancy is there because, as you've designed option 1, a "closed" answer has to refer to the question and an "open" answer will refer to the question via its options.


What I would do is make a "closed" answer to be a question option with no name. An I'd mark whether or not question options can have a value.

In addition, I would remove the enum from question. It is possible to have a question marked as closed and yet have multiple options. The question options are the single source of truth, look at them. If necessary later for optimization, you can add an update/delete/insert trigger on question_option to update the enum.

-- You didn't specify a database, so I'll use PostgreSQL.
create table users (
id bigserial primary key,
name text not null
);

create table questions (
id bigserial primary key,

-- Which is the text of the question? Consider question_text.
-- It's unclear what a description of a question is. Extra
-- explanatory text?
name text not null,
description text
);

create table question_options (
id bigserial primary key,
question_id bigint not null references questions,

-- name is an odd way to describe the text of
-- the option. Label?
-- I went with not null here because this forces
-- you to choose a label or an empty string.
label text not null,

-- Does the option accept a value?
type enum('with value', 'without value') not null
);

create table answers (
id bigserial primary key,
user_id bigint not null references users,
value text,
question_option_id bigint references question_options,

-- It's useful to know when an answer was given.
answered_at timestamp not null default current_timestamp(),

-- Presumably a user can't choose the same to the same question multiple times.
-- If they can, lift the constraint later.
unique(user_id, question_option_id),

constraint answers_filled_in_check check(
values is not null or question_option_id is not null
)
);

An "open question" has one option with a blank label which accepts a value. A closed question can have any number of options with any settings.

A design consideration I did not address: what if a user can answer a question multiple times and you want to store the history of their answers?

Is it good practice to set all database columns as NOT NULL?

No. It's a good idea to set columns to NULL where appropriate.

Optimal database structure - 'wider' table with empty fields or greater number of tables?

What is the more optimal database structure from the speed point of view?

Well, what is correct, best practice, etc, is called Normalisation. If you do that correctly, there will be no optional columns (not fields), no Nulls. The optional columns will be in a separate table, with fewer rows. Sure, you can arrange the tables so that they are sets of optional columns, rather than (one PK plus) one column each.

Combining the rows from the sub-tables into one 5NF row is easy, do that i a view (but do not update via the view, do that directly to each sub-table, via a transactional stored proc).

More, smaller tables, are the nature of a Normalised Relational database. Get used to it. Fewer, larger tables are slower, due to lack of normalisation, duplicates and Nulls. Joining is cumbersome in SQL< but that is all we have. There is no cost in joins themselves, only it the tables being joined (rows, row width, join columns, datatypes, mismatches, indices [or not] ). Databases are optimised for Normalised tables, not for data heaps. And large numbers of tables.

Which happens to be optimal re performance, no surprise. For two reasons:

  1. The tables are narrower, so there are more rows per page, you get more rows per physical I/O, and more rows in the same cache space.

  2. Since you have No Nulls, those columns are fixed len, no unpacking to extract the contents of the column.

There are no pros for large tables with many optional (null) columns, only cons. There never is a pro for breaching standards.

The answer is unchanged regardless of whether you are contemplating 4 or 400 new tables.

  • One recommendation if you are seriously considering that many tables: you are heading in the direction of Sixth Normal Form, without realising it. So realise it, and do so formally. The 400 tables will be much better controlled. If you get a professional to do it, they will normalise that, and end up back at less than 100.

Why should I avoid NULL values in a SQL database?

The NULL question is not simple... Every professional has a personal opinion about it.

Relational theory Two-Valued Logic (2VL: TRUE and FALSE) rejects NULL, and Chris Date is one of the most enemies of NULLs. But Ted Codd, instead, accepted Three-Valued Logic too (TRUE, FALSE and UNKNOWN).

Just a few things to note for Oracle:

  1. Single column B*Tree Indexes don't contain NULL entries. So the Optimizer can't use an Index if you code "WHERE XXX IS NULL".

  2. Oracle considers a NULL the same as an empty string, so:

    WHERE SOME_FIELD = NULL

    is the same as:

    WHERE SOME_FIELD = ''

Moreover, with NULLs you must pay attention in your queries, because every compare with NULL returns NULL.
And, sometimes, NULLs are insidious. Think for a moment to a WHERE condition like the following:

WHERE SOME_FIELD NOT IN (SELECT C FROM SOME_TABLE)

If the subquery returns one or more NULLs, you get the empty recordset!

These are the very first few cases that I want to talk about. But we can speak about NULLs for a lot of time...



Related Topics



Leave a reply



Submit