How to Design a Database Schema to Support Tagging with Categories

How to design a database schema to support tagging with categories?

This is yet another variation on the Entity-Attribute-Value design.

A more recognizable EAV table looks like the following:

CREATE TABLE vehicleEAV (
  vid        INTEGER,
  attr_name  VARCHAR(20),
  attr_value VARCHAR(100),
  PRIMARY KEY (vid, attr_name),
  FOREIGN KEY (vid) REFERENCES vehicles (vid)
);

Some people force attr_name to reference a lookup table of predefined attribute names, to limit the chaos.

What you've done is simply spread an EAV table over three tables, but without improving the order of your metadata:

CREATE TABLE vehicleTag (
  vid         INTEGER,
  cid         INTEGER,
  tid         INTEGER,
  PRIMARY KEY (vid, cid),
  FOREIGN KEY (vid) REFERENCES vehicles(vid),
  FOREIGN KEY (cid) REFERENCES categories(cid),
  FOREIGN KEY (tid) REFERENCES tags(tid)
);

CREATE TABLE categories (
  cid        INTEGER PRIMARY KEY,
  category   VARCHAR(20) -- "attr_name"
);

CREATE TABLE tags (
  tid        INTEGER PRIMARY KEY,
  tag        VARCHAR(100) -- "attr_value"
);

If you're going to use the EAV design, you only need the vehicleTags and categories tables.

CREATE TABLE vehicleTag (
  vid         INTEGER,
  cid         INTEGER,     -- reference to "attr_name" lookup table
  tag         VARCHAR(100, -- "attr_value"
  PRIMARY KEY (vid, cid),
  FOREIGN KEY (vid) REFERENCES vehicles(vid),
  FOREIGN KEY (cid) REFERENCES categories(cid)
);

But keep in mind that you're mixing data with metadata. You lose the ability to apply certain constraints to your data model.

How can you make one of the categories mandatory (a conventional column uses a NOT NULL constraint)?
How can you use SQL data types to validate some of your tag values? You can't, because you're using a long string for every tag value. Is this string long enough for every tag you'll need in the future? You can't tell.
How can you constrain some of your tags to a set of permitted values (a conventional table uses a foreign key to a lookup table)? This is your "softtop" vs. "soft top" example. But you can't make a constraint on the tag column because that constraint would apply to all other tag values for other categories. You'd effectively restrict engine size and paint color to "soft top" as well.

SQL databases don't work well with this model. It's extremely difficult to get right, and querying it becomes very complex. If you do continue to use SQL, you will be better off modeling the tables conventionally, with one column per attribute. If you have need to have "subtypes" then define a subordinate table per subtype (Class-Table Inheritance), or else use Single-Table Inheritance. If you have an unlimited variation in the attributes per entity, then use Serialized LOB.

Another technology that is designed for these kinds of fluid, non-relational data models is a Semantic Database, storing data in RDF and queried with SPARQL. One free solution is RDF4J (formerly Sesame).

Database Design for Tagging

About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.

About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

Recommended SQL database design for tags or tagging

Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.

Table: Item
Columns: ItemID, Title, Content

Table: Tag
Columns: TagID, Title

Table: ItemTag
Columns: ItemID, TagID

Is there an agreed ideal schema for tagging

There are various schemas which are effective, each with their own performance implications for the common queries you'll need as the number of tagged items grows:

http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/

Personally, I like having a tag table and a link table which associates tags with items, as it's denormalized (no duplication of tag names) and I can store additional information in the link table (such as when the item was tagged) when necessary.

You can also add some denormalised data if you're feeling frisky and want simple selects at the cost of the additional data maintenance required by storing usage counts in the tag table, or storing tag names which were used in the item table itself to avoid hitting the link table and tag table for each item, which is useful for displaying multiple items with all their tags and for simple tag versioning... if you're into that sort of thing ;)

SQL database schema design tags

Given your constraints, yes, this design is correct.

This design also requires that each tag can belong to only a single category. No many-to-many between category and tag.

You cannot add tags to a beer that does not belong to the category those tags are associated with.

This rule must be enforced through app logic. Nothing in the database design prevents a beer being assigned to tags of categories not assigned that beer. Not a flaw in your design, just the way it is. A relational database design cannot itself enforce every kind of rule or constraint.

Given how vague the idea of user, tag, and category is in your brief description, there is no further advice to be given or further thoughts to consider as we cannot understand the business problem/context.

Database design for tags or tagging

I don't understand why this is not an efficient solution.

It is inefficient because you have to retrieve and break/search that string for every query.

When you do something like (as mentioned in your links) Three tables (one for storing all items, one for all tags, and one for the relation between the two) then you can use the real power of a relational database, the index.

Instead of breaking each string into a tag or set of tags... that's already done; you just get the ones you want. So, if you're searching for "shoes" then it goes straight there (using the index probably log n or faster) and returns both Nike and GAP. It will do this no matter how many tags you have, no matter how many companies you have.

With the 3-table system you do all of the hard work up front and then just do lookups.

If you intend to run this locally or with a limited number of users your solution may be fine. It is also easier to code.

Once your queries start taking more than a few seconds you'll probably want to update your tagging system. If you do it this way, write the search code separately in case you need to rip it out.

Question from comment:

Can you give an example of a 3 table system that is normalized with
atomicity

Sure.

You've basically asked for Third Normal Form which is my usual goal.
(I admit I often don't make 3NF because I optimize; e.g. storing a postal code with the addres - if you're out of school, that's a better choice)

--Sample SQL stackoverflow.com/questions/50793168/database-design-for-tags-or-tagging/50818392
IF NOT EXISTS (SELECT * FROM sys.schemas WHERE name = N'ChrisC') 
BEGIN
 EXEC sys.sp_executesql N'CREATE SCHEMA [ChrisC] AUTHORIZATION [dbo]'
 IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[ChrisC].[Brands]') AND type in (N'U'))
     AND NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[ChrisC].[BrandTags]') AND type in (N'U'))
     AND NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[ChrisC].[Tags]') AND type in (N'U'))
 BEGIN
  CREATE TABLE [ChrisC].[Brands]([pkBrand] [int] IDENTITY(101,1) NOT NULL,[Name] [varchar](40) NULL) ON [PRIMARY]
  INSERT INTO [ChrisC].[Brands]([Name])VALUES('Nike'),('GAP')
  CREATE TABLE [ChrisC].[BrandTags]([pk] [int] IDENTITY(1,1) NOT NULL,[Brand] [int] NULL,[Tag] [int] NULL) ON [PRIMARY]
  INSERT INTO [ChrisC].[BrandTags]([Brand],[Tag])VALUES
    (101,201),(101,202),(101,203),(101,204),(101,205),(101,206),(101,207),
    (102,208),(102,209),(102,203),(102,207),(102,210)
  CREATE TABLE [ChrisC].[Tags]([pkTag] [int] IDENTITY(201,1) NOT NULL,[Tag] [varchar](40) NULL) ON [PRIMARY]
  INSERT INTO [ChrisC].[Tags]([Tag])VALUES
    ('bags'),('football'),('shoes'),('soccer'),('sports'),('track-pants'),('t-shirts'),('jeans'),('perfumes'),('wallets')
  SELECT b.[Name], t.Tag 
  FROM chrisc.Brands b 
  LEFT JOIN chrisc.BrandTags bt ON pkBrand = Brand
  LEFT JOIN chrisc.Tags t ON bt.Tag = t.pkTag
  WHERE b.[Name] = 'Nike'
  -- Stop execution here to see the tables with data
  DROP TABLE [ChrisC].[Brands]
  DROP TABLE [ChrisC].[BrandTags]
  DROP TABLE [ChrisC].[Tags]
 END
 IF  EXISTS (SELECT * FROM sys.schemas WHERE name = N'ChrisC') DROP SCHEMA [ChrisC]
END

create tagging schema

There is a data mining tool called Weka, You can try downloading it. There are many weka videos in youtube. You can also google it there is lot of information available.

Database Schema for categorized tags

What you describe is a many-to-many (M:M) between tables Companies and Tags. The solution is to create another table Company-Tags containing FK to Compaines and Tags and any additional columns that are specific to that combination, if any.

create table company_tags(
             cmpy_id    integer   
                        references companies(cmpy_id) 
           , tag_id     integer 
                        references tags(tag_id) 
           , constraint company_tag_id 
                        primary key(cmpy_id, tag_id)
           );

That gives you the specific tags a company has selected and what companies have selected specific tags.

How to Design a Database Schema to Support Tagging with Categories