Historical/Auditable Database

Historical / auditable database

Revised 01 Jan 11

Ok, so there is a gap between where I sit (deliver fully auditable databases; yours being a particular requirement of that) and where you sit: based on your questions and comments. Which we will probably work out in the commentary. Here's a position to start from.

  • To provide this requirement, there is no need at all for: triggers; mass duplication; broken integrity; etc.

  • This is not a classic Temporal requirement, either, so no need for the "period" capability, but you can.

  • ValidFrom and ValidTo is a Normalisation error: the ValidTo is data that is easily derived; ValidTo in any row is duplicated, in the ValidFrom of the next row; you have an Update Anomaly (when you update one column in one row, you additionally have to update the other column in the next row); you have to use a dummy value for "current".

    • All unnecessary, use ValidFrom only, and keep the db clean and pure 5NF.

    • The Caveat is, if PostgreSQL can't perform Subqueries without falling in a heap (ala Oracle), then fine, kep ValidTo.

All of these things are editable in the system by users, and deletable.

Well, no. It is a database holding important information; with Referential Integrity, not a scratchpad, so the user cannot just walk up to it and "delete" something. It will contradict the same users requirement for maintaining historical data (in the Reading; Alert; Ack; Action; Download).

  • Cascading deletes are not allowed. Those functions are check boxes for non-databases, MS Access types. For real databases, the RI constraints stop parents with children from being deleted.

  • Primary Keys cannot (should not) be changed. Eg. UserId; LocationId; NetworkSlaveCode never change; remember, they are carefully considered Identifiers. One characteristic of PKs is that they are stable.

  • You can add new Users; you can change a current User's name; but you cannot delete an User who has entries in Download, Acknowledgement, Action.

Basically if it's editable then it has to be historical (so that excludes readings and alerts).

Also excludes: Downloads; Acknowledgements; Actions.

And the Reference tables: SensorType; AlertType; ActionType.

And the new History tables: they are inserted into, but they cannot be updated or deleted.

The problem I find with the isObselete flag is.. Say if you change the Location, the Sensor foreign key will now point to an obselete record, meaning you will have to duplicate every sensor record. This problem gets exponentially worse as the hierachy gets bigger.

  • Ok, so now do you understand the LocationId (FK) in Sensor will not change; there is no mass duplication, etc ? There is no problem in the first place (and there is in that stupid book!) that gets exponentially worse in the second place.

  • IsObsolete is inadequate for your requirement. (Refer below)

  • The UpdatedDtm in any real row (Reading, etc) identifies the Parent (FK to Sensor) History row (its AuditedDtm) that was in effect at the time.

  • Full Relational capability; Declarative Refential Integrity, etc.

  • Maintain the IDEF1X, Relational concept of strong Identifiers ... There is only one Current parent row (eg. Location)

  • The rows in the History are Images of the current row, before it was changed, at the stated AuditedDtm. The Current row (non-history) shows the one last UpdatedDtm, when the row was changed.

  • The AuditedDtm shows the entire series of UpdatedDtms for any given key; and thus I have used it to "partition" the real key in a temporal sense.

All that is required is a History table for each changeable table. I have provided the Hiistory tables for four Identifying tables: Location; Sensor; NetworkSlave; and User.

Please read this for understanding Auditable in the accounting sense.

Data Model

Link to Sensor Data Model with History (Page 2 contains the History tables and context).

Readers who are not familiar with the Relational Modelling Standard may find IDEF1X Notation useful.

Response to Comments

(1) My first issue is that of referential integrity with the historic data, in that I'm not sure there is any, and if there is I'm not sure how it works. For instance, in SensoryHistory it would be possible to add a record that had an UpdatedDtm indicating a date time before the location itself existed, if you see what I mean. Whether this is actually an issue I'm not sure - enforcing that might be over the top.

(You raised a similar issue in the other question.) It may be that the dbs you have experienced did not actually have the Referential Integrity in place; that the Relation lines were there just for documentation; that the RI was "implemented in app code" (which means there is no RI).

This is an ISO/IEC/ANSI Standard SQL database. That allows Declarative Referential Integrity. Every Relation line is implemented as a PK::FK Reference, an actual Constraint that is Declared. Eg:

CREATE TABLE Location
...
CONSTRAINT UC_PK
PRIMARY KEY (LocationId)
...
CREATE TABLE Sensor
...
CONSTRAINT UC_PK
PRIMARY KEY (LocationId, SensorNo)
CONSTRAINT Location_Sensor_fk
FOREIGN KEY (LocationId)
REEFERENCES Location(LocationId)
...
CREATE TABLE SensorHistory
...
CONSTRAINT UC_PK
PRIMARY KEY (LocationId, SensorNo, UpdatedDtm))
CONSTRAINT Sensor_SensorHistory_fk
FOREIGN KEY (LocationId, SensorNo)
REEFERENCES Sensor (LocationId, SensorNo)
...

Those Declared Constraints are enforced by the server; not via triggers; not in app code. That means:

  • A Sensor with a LocationId that does not exist in Location cannot be inserted
  • A LocationId in Location that has rows in Sensor cannot be deleted
  • A SensorHistory with a LocationId+SensorNo that does not exist in Sensor cannot be inserted
  • A LocationId+SensorNo in Sensor that has rows in SensorHistory cannot be deleted.

(1.1) All columns should have RULEs and CHECK Constraints to Constrain their range of values. That in addition to the fact that all INSERT/UPDATE/DELETEs are programmatic, within stored procs, therefore accidents do not happen, and people do not walk up to the database and run commands against it (excepts SELECTS).

Generally I stay away from triggers. If you are using stored procs, and the normal permissions, then this:

in SensoryHistory it would be possible to add a record that had an UpdatedDtm indicating a date time before the Location itself existed, if you see what I mean

is prevented. So is inserting a SensorHistory with an UpdatedDtm earlier than the Sensor itself. But procs are not Declarative Rules. However if you want to be doubly sure (and I mean doubly, because the INSERTS are all via a proc, direct command by users), then sure, you have to use a trigger. For me, that is over the top.

(2) how do I indicate deletion? I could just add a flag to the non-historical version of the table I guess.

Not sure yet. Eg. Do you accept that when a Sensor is deleted, it is final ... (yes, history is maintained) ... and then when a new Sensor is added to the Location, it will have a new SensorNo ... there is no Sensor being logically replaced with the new one, with or without a gap in time ?

From a end-user's point of view, via the software they should be able to add, edit and delete sensors at will with no limitation. But yes, once deleted it is deleted and cannot be undeleted. There's nothing to stop them re-adding a sensor later though with the exact same parameters.

And "delete" Locations, NetworkSlaves, and Users as well.

Ok. Then the new Sensor with the same parameters, is truly new, it has a new SensorNo, and is independent of any previous logical Sensor. We can add an IsObsolete BOOLEAN to the four identifying tables; it is now identified as adequate. The Delete is now a Soft Delete.

(2.1) For NetworkSensor and LoggerSensor, which are actually dependent on two parents: they are obsolete if either of their parents are obsolete. So there is no point giving them an IsObsolete column, which has a dual meaning, which can be derived from the applicable parent.

(2.2) Just to be clear, users cannot delete any rows from any Transaction and History tables, right?

(3) When updating a table, what method would be best to insert the new row in the historical table and update the main table? Just normal SQL statements inside a transaction maybe?

Yes. That is the classic use of a Transaction, as per ACID Properties, it is Atomic; it either succeeds in toto or fails in toto (to be retried later when the problem is fixed).

(4) Referenced Book

The definitive and seminal text is Temporal Data and the Relational Model C J Date, H Darwen, N A Lorentzos. As in, those of us who embrace the RM are familiar with the extensions, and what is required in the successor to the RM; rather than some other method.

The referenced book is horrible, and free. The PDF isn't a PDF (no search; no indexing). Opening my MS and Oracle is telling; a few good bits couched in lots of fluff. Many misrepresentations. Not worth responding to in detail (if you want a proper review, open a new question).

(4.1) ValidTo in addition to ValidFrom. Serious mistake (as identified at the top of my answer) which the book makes; then laboriously solves. Don't make the mistake in the first place, and you have nothing to solve in the second place. As I understand it, that will eliminate your triggers.

(4.2) Simple rules, taking both Normalisation and Temporal requirements into account. First and foremost, you need to deeply understand (a) the temporal requirement and (b) the DataTypes, correct usage and limitations. Always store:

  • Instant as DATETIME, eg. UpdatedDtm

  • Interval as INTEGER, clearly identifying the Unit in the column name, eg. IntervalSec

  • Period. Depends on conjunct or disjunct.

    • For conjunct, which this requirement is, (4.1) applies: use one DATETIME; the end of the period can be derived from the beginning of the period of the next row.
    • For disjunct periods, yes, you need 2 x DATETIMEs, eg, RentedFrom and a RentedTo with gaps in-between.

(4.3) They mess with the "Temporal Primary Key", which complicates code (in addition to requiring triggers to control the Update Anomaly). I have already delivered a clean (tried and tested) Temporal Primary Key.

(4.4) They mess with dummy values, non-real values, and Nulls for "Now". I do not allow such things in a database. Since I am not storing the duplicated ValidTo, I do not have the problem, there is nothing to solve.

(4.5) One has to wonder why a 528 page "textbook" is available free on the web, in poor PDF form.

(5) I [an User] could quiet happily delete all the LocationHistory rows for instance, (leaving only the current version in the Location table) - even though there may exist a SensorHistory row that conceptually "belongs" to a previous version of the Location, if that makes sense.

It does not make sense to me, there is still a gap in the communication we have to close. Please keep interacting until it is closed.

  • In a real (standard ISO/IEC/ANSI SQL) database, we do not GRANT INSERT/UPDATE/DELETE permission to users. We GRANT SELECT and REFERENCES only (to chosen users) All INSERT/UPDATE/DELETEs are coded in Transactions, which means stored procs. Then we GRANT EXEC on each stored proc to selected users (use ROLES to reduce administration).

    • Therefore no one can delete from any table without executing a proc.

    • Do not write a proc to delete from any History table. These rows should not be deleted. In this case, the non-permission and the non-existence of code is the Constraint.

    • Technically, all History rows are valid, there is no Period to concern yourself with. The oldest LocationHistory row contains the before-image of the original Location row before it was changed. The youngest LocationHistory rows is the before-image of the current Location row. Every LocationHistory row in-between is thusly valid and applies to the Period in-between.

    • No need to "prune" or look for a few LocationHistory rows that can be deleted on the basis that they apply to a Period that is not used: they are all used. (Definitively, without the need for checking for any mapping of Location children to any LocationHistory row(s), to prove it.)

    • Bottom line: an User cannot delete from any History (or Transaction) table.

    • Or do you mean something different again ?

    • Note I have added (1.1) above.

(6) Corrected one mistake in the DM. An Alert is an expression of Reading, not Sensor.

(7) Corrected the Business Rules in the other question/answer to reflect that; and the new rules exposed in this question.

(8) Do you understand/appreciate, that since we have a fully IDEF1X compliant model, re Identifiers:

  • The Identifiers are carried through the entire database, retaining their power. Eg. when listing Acknowledgements, they can be joined directly with Location and Sensor; the tables in-between do not have to be read (and they must be if Id keys are used). This is why there are in facts less joins required in a Relational Database (and more joins required in a unnormalised one).

  • the Subtypes, etc need to be navigated only when that particular context is relevant.

Laravel Auditable History Tables - Pro & Cons of SQL History or an auditing package?

I would use the package, mainly because of the ease of use and configuration.

In regards to the cons you mention:

  • single audit table containing all auditable changes of all tables
  • hard to read on DB viewers

The first, I can't really say it's a bad thing, since it makes it easy to relate a user to all the changes done across different models.

Otherwise, if you had an audit table per model (order_audits, costumer_audits, ...), you would have to use JOIN statements for simple things like getting the total number of changes a user did on a system, for instance.

The second reason you point out, I'm assuming it's because some of the data is being stored as JSON. If that's the case, you could always convert the column types that store that data from TEXT to JSON (covered in the documentation).

One of the benefits (on RDBMS that support it), is that you can use WHERE statements on JSON type columns to apply filtering and given the JSON type has been around for a while, I bet there are database viewers that can display the data properly, instead of having a string of JSON.

Database design for audit logging

One method that is used by a few wiki platforms is to separate the identifying data and the content you're auditing. It adds complexity, but you end up with an audit trail of complete records, not just listings of fields that were edited that you then have to mash up to give the user an idea of what the old record looked like.

So for example, if you had a table called Opportunities to track sales deals, you would actually create two separate tables:

Opportunities
Opportunities_Content (or something like that)

The Opportunities table would have information you'd use to uniquely identify the record and would house the primary key you'd reference for your foreign key relationships. The Opportunities_Content table would hold all the fields your users can change and for which you'd like to keep an audit trail. Each record in the Content table would include its own PK and the modified-by and modified-date data. The Opportunities table would include a reference to the current version as well as information on when the main record was originally created and by whom.

Here's a simple example:

CREATE TABLE dbo.Page(  
ID int PRIMARY KEY,
Name nvarchar(200) NOT NULL,
CreatedByName nvarchar(100) NOT NULL,
CurrentRevision int NOT NULL,
CreatedDateTime datetime NOT NULL

And the contents:

CREATE TABLE dbo.PageContent(
PageID int NOT NULL,
Revision int NOT NULL,
Title nvarchar(200) NOT NULL,
User nvarchar(100) NOT NULL,
LastModified datetime NOT NULL,
Comment nvarchar(300) NULL,
Content nvarchar(max) NOT NULL,
Description nvarchar(200) NULL

I would probably make the PK of the contents table a multi-column key from PageID and Revision provided Revision was an identity type. You would use the Revision column as the FK. You then pull the consolidated record by JOINing like this:

SELECT * FROM Page
JOIN PageContent ON CurrentRevision = Revision AND ID = PageID

There might be some errors up there...this is off the top of my head. It should give you an idea of an alternative pattern, though.

Best design for a changelog / auditing database table?

In the project I'm working on, audit log also started from the very minimalistic design, like the one you described:

event ID
event date/time
event type
user ID
description

The idea was the same: to keep things simple.

However, it quickly became obvious that this minimalistic design was not sufficient. The typical audit was boiling down to questions like this:

Who the heck created/updated/deleted a record 
with ID=X in the table Foo and when?

So, in order to be able to answer such questions quickly (using SQL), we ended up having two additional columns in the audit table

object type (or table name)
object ID

That's when design of our audit log really stabilized (for a few years now).

Of course, the last "improvement" would work only for tables that had surrogate keys. But guess what? All our tables that are worth auditing do have such a key!

How to Store Historical Data

Supporting historical data directly within an operational system will make your application much more complex than it would otherwise be. Generally, I would not recommend doing it unless you have a hard requirement to manipulate historical versions of a record within the system.

If you look closely, most requirements for historical data fall into one of two categories:

  • Audit logging: This is better off done with audit tables. It's fairly easy to write a tool that generates scripts to create audit log tables and triggers by reading metadata from the system data dictionary. This type of tool can be used to retrofit audit logging onto most systems. You can also use this subsystem for changed data capture if you want to implement a data warehouse (see below).

  • Historical reporting: Reporting on historical state, 'as-at' positions or analytical reporting over time. It may be possible to fulfil simple historical reporting requirements by quering audit logging tables of the sort described above. If you have more complex requirements then it may be more economical to implement a data mart for the reporting than to try and integrate history directly into the operational system.

    Slowly changing dimensions are by far the simplest mechanism for tracking and querying historical state and much of the history tracking can be automated. Generic handlers aren't that hard to write. Generally, historical reporting does not have to use up-to-the-minute data, so a batched refresh mechanism is normally fine. This keeps your core and reporting system architecture relatively simple.

If your requirements fall into one of these two categories, you are probably better off not storing historical data in your operational system. Separating the historical functionality into another subsystem will probably be less effort overall and produce transactional and audit/reporting databases that work much better for their intended purpose.

How to keep an audit/history of changes to the table

There are two common ways of creating audit trails.

  1. Code your data access layer.
  2. In the database itself using triggers.

There are advantages and disadvantages to both. Some people prefer one over the other. It's often down to the type of app and the type of database use you can expect.

If you do it in your DA layer it's pretty much up to you. You just need to add code to every method that saves to the database to also save a log of the changes. This auditing code could be in your DA layer code, or even in your stored procs in your database if you are using stored procs for everything. Essentially the premise is the same, any time you make a change to the database, log that change.

If you want to go down the triggers route, you can write custom triggers for each table, or fashion a more generic trigger that works the same on lots of tables. Check out this article on audit triggers. This works by firing of triggers whenever a change is made, and the triggers log the changes. Remember that if you want to audit SELECT statements, you can't use triggers, you'll have to do that with in code/stored proc auditing. It's also worth remember that depending on your database, triggers may not fire in all circumstances. For example, most databases don't fire triggers during TRUNCATE statements. Check that your triggers get fired in any case that you need auditing.

Alternately, you could also take a look at using the service broker to do async auditing on a dedicated machine. This is more complex and takes a bit of configuring to set up.

Which ever way you do it you need to decide on the format the audit log will take. Normally you would save this log in your database, but you could just save it in a log file or whatever suits your requirements. You could use a single audit table that logs all changes, or you could have an audit table per main table being audited. For large scale implementations you could even consider putting the audit tables in a totally separate database. If your logging into a table, it's common to have a "change type" field which indicates if the audited change was an insert, update or delete style of change, along with the changed data, user who made the change and the date/time the change was made. Don't forget to include the old and new data for update style changes.

How do you create an audit trail for aggregate roots?

In case of roll back wouldn't you be doing it per table basis?
Assume only change ever made to the database was since time T-1 was updation of tblOrders. In this case

  1. tblOrders would be rolled back to time T-1: Values from audit will be used to bring tblOrders back to how it was at T-1.

  2. tblOrdersLines would be rolled back to time T-1: There is no entry in tblOrdersLineAudit and hence nothing will be updated.

At the end you have your tables are in the state they were at T-1.

Few links for more info -

  • How to version control a record in a database
  • Database history for client usage
  • Using Triggers to Track Database Action History

Audit tables: Each field for table or one table

Which is a better design, one table that keep the history of
transactions or one field for each table? (Pro and cons)

Rather than focus on the 2 choices here's a answer on the 4 approaches I've worked with over the years. Each with its pros and cons.

1. Just three fields

Just add three fields (last action, time_stamp, update_user) to every table and call it a day.

Pros Super easy. Performs well

Cons You can't report on data you don't have, so this structure tells you almost nothing (except for deletes)

2. Clone table

Each table has a copy plus the three audit fields and every time a user changes a record the audit table gets inserted into.

Pros Performs pretty well. Easy to create a row by row history that the user can dig through.

Cons

  • Every change to the base table needs a corresponding change to the audit table.
  • If the users don't want a row by row history to dig through and they want a report of what exactly changed it can get nasty in a hury. See the answers to How can I write a query to extract individual changes from snapshots of data?

3. History Table only

There's no base table only a history table.
This is basically the same as Clone Table except now you have to always get the current record.

Pros Pros of 2 but everything's an insert. Less maintenance then the option 2.

Cons You'll end up losing the maintenance gain because you'll end up maintaining views or you'll be sprinkling get-the-current-record logic all over the place

4. Generic audit table

This table has four columns ( Table*, Column_name, old_value, new_value ) and the three audit fields.

Pros Easy to set up and maintain.

Cons

  • Its unintuitive but it takes up a lot of space because your old_value and new_value fields have to be nvarchar(max) or equivalent so it can accept anything that's in your base table.

  • Performs poorly on reads and writes.

  • Its a pain to set up a row by row history report

  • If there's any kind of workflow in the records audit reporting can become non-trivial. For example you get a requirement that users only want to see changes that occur after the status on the records becomes 'approved'. That's hard even in options 2 and 3 but becomes a disaster in the Generic audit approach.

Summary

I prefer #2 the Clone table approach as it seems to work best for me. I've had issues with #1 being insufficient and #4 can be a serious perf nightmare that requires a lot of work to undo.



Related Topics



Leave a reply



Submit