How to Avoid Nulls in My Database, While Also Representing Missing Data

How can I avoid NULLs in my database, while also representing missing data?

Good on you, for eliminating Nulls. I have never allowed Nulls in any of my databases.

Of course, if nulls are prohibited, then missing information will have to be handled by some other means. Unfortunately, those other means are much too complex to be discussed in detail here.

Actually it is not so hard at all. There are three alternatives.

  1. Here's a paper on How To Handle Missing
    Information Without Using NULL
    by H Darwen, that may help to get your head around the problem.

    1.1. Sixth Normal Form is the answer. But you do not have to normalise your entire database to 6NF. For each column that is optional, you need a child table off the main table, with just the PK, which is also the FK, because it is a 1::0-1 relation. Other than the PK, the only column is the optional column.

    Look at this Data Model; AssetSerial on page 4 is a classic case: not allAssets have SerialNumbers; but when they do, I want them to store them; more important I want to ensure that they are Unique.

    (For the OO people out there, incidentally, that is a three level class diagram in Relational notation, a "Concrete Table Inheritance", no big deal, we've had it fro 30 years.)

    1.2. For each such table, use a View to provide the 5NF form of the table. Sure, use Null (or any value that is appropriate for the column) to identify the absence of the column for any row. But do not update via the view.

    1.3 Do not use straight joins to grab the 6NF column. Do not use outer joins, either (and have the server fill in a Null for the missing rows). Use a subquery to populate the column, and specify the value that you want returned for a missing value (except if you have Oracle, because its Subquery processing is even worse than its set processing). Eg. and just an eg. you can convert a numeric column to string, and use "Missing" for the missing rows.

When you do not want to go that far (6NF), you have two more options.


  1. You can use Null substitutes. I use CHAR(0) for character colomns and 0 for numeric. But I do not allow that for FKs. Obviously you need a value that is outside the normal range of data. This does not allow Three Valued Logic.

  2. In addition to (2), for each Nullable column, you need a boolean Indicator. For the example of the Sex column, the Indicator would be something like SexIsMissing or SexLess (sorry). This allows very tight Three Valued Logic. Many people in that 5% like it because the db remains at 5NF (and less tables); the columns with missing info are loaded with values that are never used; they are only used if the Indicator is false. If you have an enterprise db, you can wrap that in a Function, and always use the UDF, not the raw column.

Of course, in all cases, you can never get away from writing code that is required to handle the missing info. Whether it is ISNULL(), or a subquery for the 6NF column, or an Indicator to check before using the value, or an UDF.

If Null has a specific meaning ... then it is not a Null! By definition, Null is the Unknown Value.

Why should I avoid NULL values in a SQL database?

The NULL question is not simple... Every professional has a personal opinion about it.

Relational theory Two-Valued Logic (2VL: TRUE and FALSE) rejects NULL, and Chris Date is one of the most enemies of NULLs. But Ted Codd, instead, accepted Three-Valued Logic too (TRUE, FALSE and UNKNOWN).

Just a few things to note for Oracle:

  1. Single column B*Tree Indexes don't contain NULL entries. So the Optimizer can't use an Index if you code "WHERE XXX IS NULL".

  2. Oracle considers a NULL the same as an empty string, so:

    WHERE SOME_FIELD = NULL

    is the same as:

    WHERE SOME_FIELD = ''

Moreover, with NULLs you must pay attention in your queries, because every compare with NULL returns NULL.
And, sometimes, NULLs are insidious. Think for a moment to a WHERE condition like the following:

WHERE SOME_FIELD NOT IN (SELECT C FROM SOME_TABLE)

If the subquery returns one or more NULLs, you get the empty recordset!

These are the very first few cases that I want to talk about. But we can speak about NULLs for a lot of time...

Standard use of 'Z' instead of NULL to represent missing data?

Sack your contractor.

Okay, seriously, this isn't standard practice. This can be seen simply because all RDBMS that I have ever worked with implement NULL, logic for NULL, take account of NULL in foreign keys, have different behaviour for NULL in COUNT, etc, etc.

I would actually contend that using 'Z' or any other place holder is worse. You still require code to check for 'Z'. But you also need to document that 'Z' doesn't mean 'Z', it means something else. And you have to ensure that such documentation is read. And then what happens if 'Z' ever becomes a valid piece of data? (Such as a field for an initial?)

At a basic level, even without debating the validity of NULL vs 'Z', I would insist that the contractor conforms to standard practices that exist within your company, not his. Instituting his standard practice in an environment with an alternative standard practice will cause confusion, maintenance overheads, mis-understanding, and in the end increased costs and mistakes.


EDIT

There are cases where using an alternative to NULL is valid in my opinion. But only where doing so reduces code, rather than creating special cases which require accounting for.

I've used that for date bound data, for example. If data is valid between a start-date and an end-date, code can be simplified by not having NULL values. Instead a NULL start-date could be replaced with '01 Jan 1900' and a NULL end-date could be replaced with '31 Dec 2079'.

This still can change behaviour from what may be expected, and so should be used with care:

  • WHERE end-date IS NULL no longer give data that is still valid
  • You just created your own millennium bug
  • etc.

This is equivalent to reforming abstractions such that all properties can always have valid values. It is markedly different from implicitly encoding specific meaning into arbitrarily chosen values.

Still, sack the contractor.

What to use instead of Null if no data is present in SQL?

NULL is a perfectly fine value to use. If you're worried about defaults, make sure to use an OUTER JOIN and (for SQL Server, anyway) you can do something like:

SELECT user_table.name, COALESCE(preferences.color_preference, 'DEFAULT_VALUE') FROM user_table LEFT OUTER JOIN preferences ON user_table.id = preferences.id;

This type of query will allow you to set a default and store NULL as the preferred color, you will get the default in this case if the color_preference is NULL or if there is no row in the preferences table.

You say "what would the FC column be filled with if the user chose to not choose a color?". I ask, why do you care? They either have a favorite color specified or do not. Do you care if they had a choice of specifying a favorite color but did not tell you?

What is the best / correct way to deal with null values in a database when you have different types of related data

Consider the following.

  • My Favorite Dog Breed Is
    • Beagle
    • Basset Hound
    • Boxer
    • Other, please specify _____________

You can have an answer to a question option which also has a value.

Now consider that both your answer and your question option have their own references to the question. It is possible to have the answer and the answer's question options refer to different questions! This could be constrained with triggers, but it's better to eliminate the redundancy.

That redundancy is there because, as you've designed option 1, a "closed" answer has to refer to the question and an "open" answer will refer to the question via its options.


What I would do is make a "closed" answer to be a question option with no name. An I'd mark whether or not question options can have a value.

In addition, I would remove the enum from question. It is possible to have a question marked as closed and yet have multiple options. The question options are the single source of truth, look at them. If necessary later for optimization, you can add an update/delete/insert trigger on question_option to update the enum.

-- You didn't specify a database, so I'll use PostgreSQL.
create table users (
id bigserial primary key,
name text not null
);

create table questions (
id bigserial primary key,

-- Which is the text of the question? Consider question_text.
-- It's unclear what a description of a question is. Extra
-- explanatory text?
name text not null,
description text
);

create table question_options (
id bigserial primary key,
question_id bigint not null references questions,

-- name is an odd way to describe the text of
-- the option. Label?
-- I went with not null here because this forces
-- you to choose a label or an empty string.
label text not null,

-- Does the option accept a value?
type enum('with value', 'without value') not null
);

create table answers (
id bigserial primary key,
user_id bigint not null references users,
value text,
question_option_id bigint references question_options,

-- It's useful to know when an answer was given.
answered_at timestamp not null default current_timestamp(),

-- Presumably a user can't choose the same to the same question multiple times.
-- If they can, lift the constraint later.
unique(user_id, question_option_id),

constraint answers_filled_in_check check(
values is not null or question_option_id is not null
)
);

An "open question" has one option with a blank label which accepts a value. A closed question can have any number of options with any settings.

A design consideration I did not address: what if a user can answer a question multiple times and you want to store the history of their answers?

Should I use NULL or an empty string to represent no data in table column?

I strongly disagree with everyone who says to unconditionally use NULL. Allowing a column to be NULL introduces an additional state that you wouldn't have if you set the column up as NOT NULL. Do not do this if you don't need the additional state. That is, if you can't come up with a difference between the meaning of empty string and the meaning of null, then set the column up as NOT NULL and use empty string to represent empty. Representing the same thing in two different ways is a bad idea.

Most of the people who told you to use NULL also gave an example where NULL would mean something different than empty string. And in those examples, they are right.

Most of the time, however, NULL is a needless extra state that just forces programmers to have to handle more cases. As others have mentioned, Oracle does not allow this extra state to exist because it treats NULL and empty string as the same thing (it is impossible to store an empty string in a column that does not allow null in Oracle).

use NULL value or represent NULL with valid value

Why would you use valid data to represent NULL if you have the opton to actually use NULL itself?

I do not see any benifit



Related Topics



Leave a reply



Submit