Bulletin Board - Database Optimisation

Bulletin board - Database optimisation

Part I

Revised 09 Dec 10 01:00 EST

Looked at your DDL. Ok. We need to take a step back and organise your database first. That will solve half your problems (your SQL will be straight-forward; and fast; less indices; no temp tables required). For a while I thought, aha, you have your columns, it must be stable, but there is no chance. Top down from scratch, ok. Have a look at this Entity Relation Diagram (no use working on the Data Model, which is Entities, Relations and Attributes, until we get the ERs right), and check that it is correct.

The way to do that is, answer the following questions (short answers are fine). These questions are clarifying the Entities and Business Rules. How you understand databases in general, and your data in particular is crucial. You have come a long way, on your own, so we can take it from there.
I think ▶this post◀ might be helpful to you, in order to understand the formal stages that should be followed; which we are short-circuiting here.
Most important, totally, and completely, forget about the function and any coding requirements. Data has to be modelled independent of the application, simply as Data. Function Modelling is a different science. First get one right; then get the other right; and the two together play beautiful tunes. Try jamming them together; doing both tasks at the same time, and they won't even make a suburban garage band.

For brevity, and the sake of anyone reading this, I with use a Closed and Open Section; when an Open item (discussion) is closed, I will make it concise, and move it to the Closed section. Maintain the numbering, because things sometimes come back to haunt us. You may wish to do the same, or even delete the discussion on your side.

The links for the pretty pictures are at the end.

Apologies: the editing does not work; sub-numbering is inconsistent

Closed Issues

users.bb_locations_csv is a many-to-many relation between users and locations:
- Each of those elements should be an entry in a discrete column, in a discrete row
- One users can have many locations and 1 location can have many users is many-to-many
- Read ▶this post◀ for a discussion of how that is treated and what stage it is dealt with
- At this Logical Stage, that is just a n::n relation, as I have drawn, you can forget about it for now, it will be supplied, simply, when we get to the physical Stage.
- Trust me, I will provide code that in no more complex than ...WHERE IN () for your declared purpose.
- On second thought, if I break your fingers, you will type even slower, so I better not
- Ok, your app is browser based, and the page is dynamic (my advice was for static pages that need to be touched up); go ahead with check boxes.
  
  .
users.bb_categories_csv is many-to-many relation between users and categories
- Ditto.
  
  .
Confirmed: a bulletin (bbs) does not exist without an user; an user issues a bulletin, and that starts the whole cycle; then invites replies and ratings.
3.1 Confirmed: There is really only one bulletin board and it does not exist as a Thing in the database.
3.2 Confirmed: that the org will never have more than one bulletin board, and the classifications and categorisations are all adequately handled by the Category table/function
Deleted.
Confirmed: The difference between bulletins and replies is that replies are dependent on a bulletin to exist, they do not have a title and they are not categorised by location or category because they are dependent on the bulletin itself to exist.
Deleted.
Comments noted. Resolved.

7.1. For each single bulletin submitted by another user, each user can post more than one reply.

7.2. For each single bulletin submitted by an user, that user can post one, or more than one reply.

7.3. Deleted.

7.4. Deleted.

The Data Model now allows more than one reply per user per bulletin; including the User who submitted the bulletin.

.

8. Confirmed: each user can post at most one rating to a bulletin (which can be revoked/changed)

.

9. Confirmed: each user can post at most one rating to a reply (ditto)

10.1. Given: username comes from the organisation and is the unique name that identifies employees. For example emails are username@organisation.org - authentication is done with ldap and this is required in order to connect an retrieve other information about the employees

Confirmed: UserName is an excellent Identifier

10.2. Confirmed: FirstName, LastName ... BirthPlace, etc remain as (the traditional) columns for ensuring People are not duplicated.

.

11. Given: At the moment we can Identify our offices by casual names which are generally know within the organisation, since we only have about 3 main offices and many field offices. So examples would be Washington DC or virginia field office. In total I think we will try and keep the total below 20. I want to record the exact address of each location as well because that could be used to uniquely identify offices to users.

Provided: StateCode+Town as PK; IsMainOffice as boolean.

.

12. Confirmed: Description and Name for Category are required.

.

13. Given:Users will not be able to post to some categories. Only users with sufficiently high rights will have the right to post to certain categories.

Provided: Permission in User, Location, Category is a method of evaluating such rights.

.

14. Confirmed: Location.Administrator is UserId of admin for the Location.

.

15. Given: There will only ever be a need for a like or a dislike. I don't think there needs to be a neutral position because this is the same as just not voting? Liking seems more relevant to bulletin replies that posts to be honest. Ie 'i see your response and instead of writing my own I will just agree with you - the existing bulletin board is somewhat of a social aspect in the orgainsation and I think liking and disliking/agreeing and dissagreeing creates a level of controversy that encourages participation. However liking or disliking a bulletin may not always be entirely appropriate.

15.1 Provided: Like as boolean in BulletinRating and ResponseRating. This will require interpretation on every access.

15.2. When it is no longer a boolean, it can be changed to a RatingCode, and implemented as a Lookup table. The names are then determined by Joins, and interpretation is eliminated. I drew this in the First Data Model, so that you could see what I meant
15.3. Removed in the Second Data Model.

.

16. Confirmed: each user has a home Location (other than the list of Locations that they are interested in).

.

17. Confirmed: Permission as per (13).

.

18. Confirmed: Further Permissions may be be required, as per Data Model.

18.1. If you do this now, you won't have to worry about when organisation decides to prevent a certain Person from posting Responses or Bulletins, or Rating them; and wants that feature implemented yesterday.

18.2. Even if you do not implement it, leave gaps between the values you do implement.

.

19 Confirmed: a Bulletin is about a Location.

19.1. Confirmed: There are no Bulletins without a Location

19.2. Confirmed: There are no Bulletins without a Location.

19.3 Confirmed: There are no Bulletins without a User (declarative). But so far we have no way of constraining that User; therefore any User can inset a Bulletin for any Location ( you could constrain it in code, eg. to Locations each User Is Interested In.

19.4 Confirmed: There are no BulletinRatings without a Bulletin and a rating User.

19.5 Confirmed: There are no Responses without a Bulletin.

19.4 Confirmed: There are no ResponseRatings without a Response and a rating User.

19.7. But, there can be Users, Locations, andCategories`, independently.

.

20. If you do not mind, I will provide naming conventions, etc. They should be self explanatory, and the value will show up only when you start coding SQL. Please ask, if anything isn't. For starters, all names are singular. Mixed Case is easier to read (you are supposed to use capitals for SQL language).

20.1. My experience is table_name as opposed to tableName are really technie forms, and users do not like them; Consistent mixed case is liked by everyone. It is one of those things that is impossible to change, so choose carefully.

.

21. For your need to group tables together, which is good, keep in mind that that is a Physical issue. At the Logical Data Model level, the tables have normal names, uncluttereded by physical issues. Imagine that the physical tables are prefixed with something like (and please use capitals for this):

- REF_ for reference (such as User) and lookup tables

- BUL_ for Bulletin system

.

I am not able to name tables with uppercase letters? Im not sure why. I don't know why I can't have uppercase table names. Is it to do with using MyIsam database tables?

The universal convention is that SQL Language is expressed in upper case; every report and admin tool I have ever used generates such SQL code. So we can't use upper case. Lower case or mixed case only. So the choices boil down to table_name or TableName; we need a separator of some kind. For reasons already provided, I strongly recommend mixed case, capiatlised, and not the OO style with the leading letter uncapitalised.

.

22. rank (all) can be derived directly from the database (remember, do not worry about the code during Data Modelling). If you store it, it is a Normalisation error; a duplicated column; which has to be kept up-to-date; which can get out of synch with the derived value; which is called an Update Anomaly. Fifth Normal Form eliminates Update Anomalies. That is my minimum level of Normalisation, so that is what you will get from me.

22.1. I am not interfering with the sort order or popularity issue at all; in fact, by the sounds of it, you haven't closed that functionality. I am only taking redundant data, the rank column, out, as part of the Normalisation process.

22.2. Here's a ▶Quick Tutorial◀ on the RANK() operator (as it is commonly known). It is not ANSI SQL; it is an Oracle and MS extension. However it is not required if you understand Subqueries, which is why Sybase does not have it. I doubt MySQL has it, so you need to get your head around it. Understanding Scalar Subqueries is a pre-requisite. Sybase syntax, so whack your semi-colons in, etc. Feel free to ask specific questions.

.

I have never seen that approach of writing Rank = (SELECT.... Is that the same as (SELECT ...) as Rank?

I have posted a separate Answer for that.

.

22.3. Needing to understand why, is no problem at all. Only children blindly follow simple rules, and you are certainly not one of them.

.

23. Confirmed: users.total_bulletins is redundant; it can be derived. Removed.

.

24. All your PKs are Ids. Haven't you gotten tired of getting lost in the code yet ? Forget about sticking Idiot PKs on everything that moves, let's find out How your users Identify their Entities; what Entities are truly Independent, and the other which depend on Independent Entities.

24.1. Never use Id or any such form. Where it is a PK, use the full form.

24.2. Call location_id, location_id, wherever it is, including the PK table. The exception is when you need to show the role. This will become clear in the Data Model.

.

25. You have no Declarative Referential Integrity, no Defined Foreign keys. That is bad news for many different reasons. Once these questions are clairified, please add them in. DRI means that as much as possible, if not all, Integrity is Declared in SQL. ISO/IEC/ANSI SQL standard allows for this, but the freeware end of the market does not provide the standard, and is slowly catching up. It means the server will not allow a row in the FK table to be added unless the PK exists in the parent table. MySQL recently provided DRI for Foreign Keys. For FKs, refer to ▶this article◀.

25.1. For CHECK constraints and RULES, you will have to implement those in code.

my foreign keys are like, users-id(fk) = users.id(pk) Im not sure how to add them other that what I have done but will certainly do so once I know how to.

That's not adding them into your db; that's merely referencing columns in a WHERE clause in Data Manipulation Language, not Data Definition Languge. Adding them, so that they function at the db/server level, means declaring them in DDL, as per the linked article. Then MySQL will stop a row from being inserted to a child table (FK) where the parent PK does not exist. That is Referential Integrity. If it is declared in DDL, it is Declarative Referential Integrity.

In addition to enforcement of RI, everyone can see the definition: report tools can be used by the users to access and report from the db, without having to get someone to code a report.

Yes, as far as I know. Confirmed at ▶this site◀. The code I have provided for the subquery uses DRI, so we can test that and get it out of the road early. You have to check for your specific version of MySQL.

Twenty-Five. Comments Noted. I ama not a MySQL specialist. Yes, those are the issues you have to figure out for yourself. In general, from my perusing, MySQL is legless; for anything SQL-ish, you need InnoDB.

But do not let that hold you back. Use Engine=MySQL for now, without the Declarative SQL, and keep going with both the Data Model and the Subquery. Work on InnoDB in the background.

To be clear, the DDL I have provided should work for MyISAM (and "do nothing" in the DRI department, until you get InnoDB).

.

27. Given: I have rethought the sorting requirements for bulletin. Users could sort chronologically- easy,makes sense. Users could sort bulletins by the date of the latest reply to the bulletin. Then we can forget about rank and it should be really easy to sort bulletins chronologially by the time of their last response? What are your thoughts.

Yes. that is sensible and quite common, most people understand chronological order. You will have to mess with the filters they choose in the search window (choose: Location or list; choose: Category or list; choose: My Bulletins or all).

Open Issues

(Nil)

Data Model

Ok, assuming you do not have issues with the ERD, and implementing all Closed Issues, I have modelled the data, and prepared a Fifth Data Model 09 Dec 10 for your review. I definitely need much more feedback, questions, etc, on this. I am experiencing difficulty accepting that it is done. Probably best to start writing real code for your problem areas.

Links

▶Link to IDEF1X Notation◀ You really need to read and understand this, before you read the Data Model.

▶Link to Fifth Bulletin Data Model◀ The Entity Relation Diagram is on the first page, followed by the Data Model.

The Keys are pretty much straight IDEF1X (except for UserId which I provided as a counterpoint); which means purse Relational Keys. Un-enhanced and not optimised for Physical considerations. Before you baulk at them, first notice them, register them, and evaluate them. Of course we can add Idiot keys, but before we do that, let's make sure we understand what we are going to lose.
Notice the Identifiers (solid lines) as per the Notation document. The spine, the vertebrae of the system is Location ... Bulletin ... Response.
Notice that Keys actually implement many Business Rules.
Notice the Natural Hierarchy that I have rendered. See if there is any meaning in it for you.
The VerbPhrases are really important; see if they mean anything.

Comments re First Data Model and Responses

One question I have is that the primary key of the location will be used to form the child primary key?(they are joined by a solid line) I don't really understand that concept

Yes. the PK for Location (above the line) is (StateCode, Town). That PK the two columns together, a compound key, is migrated from Location to Bulletin anyway, as an FK (bold). We are additionally using it to form the Bulletin PK (above the line).

If and when we need a Surrogate key, we will add it. For now, we are working out the Identifiers. So the question to contemplate is:

What is a good Identifier for Bulletin ?, what do your users naturally use to Identify a Bulletin ...
"have you seen the bulletin from Virginia FO yesterday ?",
"Sally from Washington sure writes good bulletins", etc.

or why that relationship does not exist between the user and the bulletin?

Well, that relation cannot exist between User and Bulletin, but a relation exists, the dotted line, meaning UserId is an FK in Bulletin(bold), but not used it to form its PK (below the line).

Or do you mean: the User is a strong Identifier for Bulletin (and therefore should be used to form the BulletinPK, therefore the line should be solid) ?

Fine. Excellent. That is what modelling re Identifiers is all about. That clears up an area that I did not like, in that we had non-unique indices. That resolves my issue as well.

As per intention stated further above, since I have now shown Rating as a table and what the rendering would be, once, I shall remove it
I think Permission should be an Entity.
Bulletin PK is now (StateCode, Town, UserId, SequenceNo). To be clear, SequenceNo is within StateCode, Town, UserId: it will be 5 for Sally's 5th bulletin re MO/Billngs FO.
Note that user Settings BulletinsPerPage,etc, are 1::1 with User, so they are in User; child table would be incorrect.
Typographical errors corrected.

Comments re Second Data Model and Responses

The PKs for both Bulletin and Response have been changed to reflect (7). BulletinNo and ResponseNo have been replaced with BulletinDate and ResponseDate (which used to be CreatedDate), in order to allow multiple replies per User per Bulletin.

Comments re Third Data Model and Responses

Trust you had a good break.

At least 30 years ago (that I am aware of), the giants in the industry had this debate. Names are always singular. Tables are nouns. VerbPhrases are verbs. This is not limited to db naming conventions, it applies to documents, theses, dissertations, etc. You may have 5 conclusions at the end of the the doc, but the section or chapter title, in both the ToC and the top of the page is "Conclusion".
After fighting them all the way through Uni, as soon as I started my first paid programming job, and saw the importance of the rules in the real world, as opposed to the theoretical arguments we had in college, I gave it up as a waste of time. All that time and energy I wasted was released to do productive work. Since then, I don't question the giants; I just accept. That their minds are greater than mine. It is like accepting Standards, or behaving within the law, or God. I have no really, really good reasons for doing anything illegal.
Anyway, the ease of languaging (discussion, SQL, documentation) that is supported by such rules cannot be adequately explained; as you write more and more SQL code, it will become clear.
You are always free to use whatever you want. I deliver singular only.
Fine with me.
But you need to keep in mind, those two elements, in the identified sequence (ala non-PK Unique Index, or Alternate Key) are universally required to establish Uniqueness for a Person. Removing them will result in two things. First, you will no longer be able to identify uniqueness across Users (and thus you may have duplicate rows). Second, the AK becomes non-unique, an Inversion Entry.
The point is (contrary to one of the posts), any column that is 1::1 with theUserPK, should reside inUser. All preference settings. Since we cleaned up theInterestedLocationsandInterestedCategories, I know only of onlyBulletinsPerPageremaining; but I am sure there are others. IsPreference2is an eg. of a boolean;NumPreference3is an eg. of an Integer. Etc. You can tell me what the real Preferences are.
(Let's try that in plural: ... any column that is 1::1 with theUsersPK, should reside inUsers. Just doesn't do it for me, I get hung up on the broken English, and I am a bit precious about my mother tongue.)
Data Model Updated.
Excellent. Let me know when you are comfortable with that, and I will give you the Physical Model.
How about the VerbPhrases ?

Comments re 06 Dec 10 20:38 EST (Small Updates)

.

28. Where there is only one occurrence of PK as an FK, of course, the FK column name is the same as the PK column name. However, when there is more than one occ of the FK (take a look at ResponseRating), there are three UserIds), we need to differentiate them. In IDEF1X terminology this is called Roles. The Role of the User who issued the Bulletin is Issuer, and so on. Obviously it is better to use that name, and keep it consistent throughout the hierarchy (not UserId in Bulletin and then when we get to Response, where there are two, and a differentiation is demanded, change it to IssuerId. I thought you might have a problem with that; in the early stages, the usage is Issuer.UserId so that it is absolutely clear the it is UserId as an FK, and the Role is Issuer; when we get to the physical model, it gets simplified to IssuerId.

Likewise, we have many DateTime columns (Date for short if you like; otherwise Dtm), that need to be differentiated.

.

29. Did the IDEF1X Notation doc not make sense ?

The PK for each table is above the line, in the specified order.
Remember we are carrying the PKs of the parent tables anyway, and if there is meaning, using those FKs to form the child PK.
For Bulletin:
- The Location FK (StateCode, Town) for which it is Issued
- The UserId of the Issuer
- and DateTime it was Issued, to make it unique.
- therefore (StateCode, Town, IssuerId, BulletinDate)`
To delete all ResponseRatings for this Bulletin, use WHERE = on those four Bulletin columns.

.

30. Because (State, Town) is the PK of Location, carrying wherever. And it forms part of the Bulletin PK, so any dependent tables carry those columns because they are carrying the Bulletin PK.

Look for the coloured Tabs (This version only)
.

32. Those are Verb Phrases. The way to read them is detailed in the Notation doc. It appears you have a good handle on it. It is really important to get the table names (and the Verb Phrases) right, because change is difficult after implementation. If you tell me Office is better than Location, that's fine with me.

Read: Office Is Activated By Bulletin

Feel free to supply another Verb Phrase.

AFAIC, the Office is dead to the rest of the org, and only comes alive on their radar (is activated by) the issue of a Bulletin.

I realise it sounds silly here, but ignore that for a moment, something along the lines of "Office expresses its aliveness; advertises its activity, by issuing a Bulletin".

Have a quiz at Mark's Sensor Data Model, for some nice Verb Phrases.

We had previously identified that (State, Town) is the PK, ~~I will leave that as is~~ Refer to (38) for change.

.

33. Worth discussion. Yes, if you are going to display it when (eg) displaying Responses, and the users understand UserName. No, if it is 30 bytes, and there is also an unique 4 byte UserId. The idea is to make these choices consciously, aware of what you are giving up, when you eventually decide that some 6 column 30-byte key is too cumbersome to migrate to the children.

I did state at the outset, I would use UserId as a typical Id Pk, because it is carried/migrated to several child tables.
We can leave how that is created for later. But it is a pure Surrogate PK.

.

34. No problem. Category already has it. I'll change Order to ListOrder.

.

35. Sure. Based on what I have read and heard, I am quite happy with it. But I would like more back-and-forth to achieve some confidence, before you write code. Alternately, view it as a learning experience, and accept that the model and code may change later. Would you like me to produce the Physical now ? If you give me any and all corrections, I will publish the next version. I am expecting preferences in User. Also, quickly run through the functions and check that you have all the columns you need.

Do look at some of the other answers, for the purpose of learning, and interest.

.

36. Joins. You just join on ~~four~~ three columns as opposed to one. SQL is cumbersome with joins, and the new syntax which was supposed to make it easier, is actually more cumbersome. My coders never write joins: we save time and typos. I have a proc that given two or more tables, will generate the code with all the columns and joins. I don't know enough of MySQL to convert that for you.

Data Model Updated.

.

Comments re 08 Dec 10 20:49, Fourth Data Model and Responses

.

Check the previous section immediately above, there are small updates.

IDEF1X: Your speed is fine.

Note the child always "inherits" the Parent PK, as an FK (either solid or broken line), otherwise there is no Relation between them. By using these columns that exist in the child anyway, to form the child PK, we carry the meaning (and that is the difference between solid and broken). And thus we do not need to look for an independent Identifier for the child. The Relational power in this method will become clear later, when you are coding.

The section we are dealing with is about Identifiers: natural vs unatural; meaningful vs meaningless. Later you will see how we can use the Relational capability of the engine, when the child PK is formed from the parent PK. (Isn't your surname the same as your father's ?)

It is also important to understand Relational databases and their capability. That is lost when we approach the database (eg) from an OO perspective, and treat it as a location to make our classes "persistent". Therefore, we will try to learn and use Relational terms. It gets difficult when you go to France and expect that they speak American, and use the same currency; learn to speak 10 words of French, and they welcome you with open arms, and you'll have quite a different experience with the locals.

Anyway, go ahead with implementing the model. Just realise we will probably make a change at some point. Save all your DDL. Save all your test data as insert statements or as a table backup or character format export (no idea what MySQL can/cannot do in this area).
.

37.1. Handled, the n::n Relation with Office & Category. You will only "see" that when we get to the Physical Model.

37.2. Done.

37.3 Done.

.

38. Excellent. Shorter as well. Note they will never be able to have two Offices in the same Zip Code. NUMERIC(5,0) is good, but I thought the US was moving towards 7 digits. Doesn't matter, you can figure it out; it is an excellent PK for Office. Now this column, which was part of Address, probably ZipCode, has been elevated to a higher purpose, without duplication; since we are carrying it in 5 child tables, and we want the PK name to be clear, as per previously explained conventions, we will call it OfficeCode; OfficeZipCode might be silly.

How can I further optimize a derived table query which performs better than the JOINed equivalent?

Well, I found a solution. It took a lot of experimentation, and I think a good bit of blind luck, but here it is:

CREATE TABLE magic ENGINE=MEMORY
SELECT
  s.shop_id AS shop_id,
  s.id AS shift_id,
  st.dow AS dow,
  st.start AS start,
  st.end AS end,
  su.user_id AS manager_id
FROM shifts s
JOIN shift_times st ON s.id = st.shift_id
JOIN shifts_users su ON s.id = su.shift_id
JOIN shift_positions sp ON su.shift_position_id = sp.id AND sp.level = 1

ALTER TABLE magic ADD INDEX (shop_id, dow);

CREATE TABLE tickets_extra ENGINE=MyISAM
SELECT 
  t.id AS ticket_id,
  (
    SELECT m.manager_id
    FROM magic m
    WHERE DAYOFWEEK(t.created) = m.dow
    AND TIME(t.created) BETWEEN m.start AND m.end
    AND m.shop_id = t.shop_id
  ) AS manager_created,
  (
    SELECT m.manager_id
    FROM magic m
    WHERE DAYOFWEEK(t.resolved) = m.dow
    AND TIME(t.resolved) BETWEEN m.start AND m.end
    AND m.shop_id = t.shop_id
  ) AS manager_resolved
FROM tickets t;
DROP TABLE magic;

Bulletin Board - Database Optimisation