Is There an Agreed Ideal Schema for Tagging

Is there an agreed ideal schema for tagging

There are various schemas which are effective, each with their own performance implications for the common queries you'll need as the number of tagged items grows:

  • http://howto.philippkeller.com/2005/04/24/Tags-Database-schemas/
  • http://howto.philippkeller.com/2005/06/19/Tagsystems-performance-tests/

Personally, I like having a tag table and a link table which associates tags with items, as it's denormalized (no duplication of tag names) and I can store additional information in the link table (such as when the item was tagged) when necessary.

You can also add some denormalised data if you're feeling frisky and want simple selects at the cost of the additional data maintenance required by storing usage counts in the tag table, or storing tag names which were used in the item table itself to avoid hitting the link table and tag table for each item, which is useful for displaying multiple items with all their tags and for simple tag versioning... if you're into that sort of thing ;)

Recommended SQL database design for tags or tagging

Three tables (one for storing all items, one for all tags, and one for the relation between the two), properly indexed, with foreign keys set running on a proper database, should work well and scale properly.

Table: Item
Columns: ItemID, Title, Content

Table: Tag
Columns: TagID, Title

Table: ItemTag
Columns: ItemID, TagID

How to design a database schema to support tagging with categories?

This is yet another variation on the Entity-Attribute-Value design.

A more recognizable EAV table looks like the following:

CREATE TABLE vehicleEAV (
vid INTEGER,
attr_name VARCHAR(20),
attr_value VARCHAR(100),
PRIMARY KEY (vid, attr_name),
FOREIGN KEY (vid) REFERENCES vehicles (vid)
);

Some people force attr_name to reference a lookup table of predefined attribute names, to limit the chaos.

What you've done is simply spread an EAV table over three tables, but without improving the order of your metadata:

CREATE TABLE vehicleTag (
vid INTEGER,
cid INTEGER,
tid INTEGER,
PRIMARY KEY (vid, cid),
FOREIGN KEY (vid) REFERENCES vehicles(vid),
FOREIGN KEY (cid) REFERENCES categories(cid),
FOREIGN KEY (tid) REFERENCES tags(tid)
);

CREATE TABLE categories (
cid INTEGER PRIMARY KEY,
category VARCHAR(20) -- "attr_name"
);

CREATE TABLE tags (
tid INTEGER PRIMARY KEY,
tag VARCHAR(100) -- "attr_value"
);

If you're going to use the EAV design, you only need the vehicleTags and categories tables.

CREATE TABLE vehicleTag (
vid INTEGER,
cid INTEGER, -- reference to "attr_name" lookup table
tag VARCHAR(100, -- "attr_value"
PRIMARY KEY (vid, cid),
FOREIGN KEY (vid) REFERENCES vehicles(vid),
FOREIGN KEY (cid) REFERENCES categories(cid)
);

But keep in mind that you're mixing data with metadata. You lose the ability to apply certain constraints to your data model.

  • How can you make one of the categories mandatory (a conventional column uses a NOT NULL constraint)?
  • How can you use SQL data types to validate some of your tag values? You can't, because you're using a long string for every tag value. Is this string long enough for every tag you'll need in the future? You can't tell.
  • How can you constrain some of your tags to a set of permitted values (a conventional table uses a foreign key to a lookup table)? This is your "softtop" vs. "soft top" example. But you can't make a constraint on the tag column because that constraint would apply to all other tag values for other categories. You'd effectively restrict engine size and paint color to "soft top" as well.

SQL databases don't work well with this model. It's extremely difficult to get right, and querying it becomes very complex. If you do continue to use SQL, you will be better off modeling the tables conventionally, with one column per attribute. If you have need to have "subtypes" then define a subordinate table per subtype (Class-Table Inheritance), or else use Single-Table Inheritance. If you have an unlimited variation in the attributes per entity, then use Serialized LOB.

Another technology that is designed for these kinds of fluid, non-relational data models is a Semantic Database, storing data in RDF and queried with SPARQL. One free solution is RDF4J (formerly Sesame).

What is the most efficient way to store tags in a database?

One item is going to have many tags. And one tag will belong to many items. This implies to me that you'll quite possibly need an intermediary table to overcome the many-to-many obstacle.

Something like:

Table: Items

Columns: Item_ID, Item_Title, Content

Table: Tags

Columns: Tag_ID, Tag_Title

Table: Items_Tags

Columns: Item_ID, Tag_ID

It might be that your web app is very very popular and need de-normalizing down the road, but it's pointless muddying the waters too early.

Database Design for Tagging

About ANDing: It sounds like you are looking for the "relational division" operation. This article covers relational division in concise and yet comprehendible way.

About performance: A bitmap-based approach intuitively sounds like it will suit the situation well. However, I'm not convinced it's a good idea to implement bitmap indexing "manually", like digiguru suggests: It sounds like a complicated situation whenever new tags are added(?) But some DBMSes (including Oracle) offer bitmap indexes which may somehow be of use, because a built-in indexing system does away with the potential complexity of index maintenance; additionally, a DBMS offering bitmap indexes should be able to consider them in a proper when when performing the query plan.

Database design for tags or tagging

I don't understand why this is not an efficient solution.

It is inefficient because you have to retrieve and break/search that string for every query.

When you do something like (as mentioned in your links) Three tables (one for storing all items, one for all tags, and one for the relation between the two) then you can use the real power of a relational database, the index.

Instead of breaking each string into a tag or set of tags... that's already done; you just get the ones you want. So, if you're searching for "shoes" then it goes straight there (using the index probably log n or faster) and returns both Nike and GAP. It will do this no matter how many tags you have, no matter how many companies you have.

With the 3-table system you do all of the hard work up front and then just do lookups.

If you intend to run this locally or with a limited number of users your solution may be fine. It is also easier to code.

Once your queries start taking more than a few seconds you'll probably want to update your tagging system. If you do it this way, write the search code separately in case you need to rip it out.


Question from comment:

Can you give an example of a 3 table system that is normalized with
atomicity

Sure.

You've basically asked for Third Normal Form which is my usual goal.
(I admit I often don't make 3NF because I optimize; e.g. storing a postal code with the addres - if you're out of school, that's a better choice)

--Sample SQL stackoverflow.com/questions/50793168/database-design-for-tags-or-tagging/50818392
IF NOT EXISTS (SELECT * FROM sys.schemas WHERE name = N'ChrisC')
BEGIN
EXEC sys.sp_executesql N'CREATE SCHEMA [ChrisC] AUTHORIZATION [dbo]'
IF NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[ChrisC].[Brands]') AND type in (N'U'))
AND NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[ChrisC].[BrandTags]') AND type in (N'U'))
AND NOT EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[ChrisC].[Tags]') AND type in (N'U'))
BEGIN
CREATE TABLE [ChrisC].[Brands]([pkBrand] [int] IDENTITY(101,1) NOT NULL,[Name] [varchar](40) NULL) ON [PRIMARY]
INSERT INTO [ChrisC].[Brands]([Name])VALUES('Nike'),('GAP')
CREATE TABLE [ChrisC].[BrandTags]([pk] [int] IDENTITY(1,1) NOT NULL,[Brand] [int] NULL,[Tag] [int] NULL) ON [PRIMARY]
INSERT INTO [ChrisC].[BrandTags]([Brand],[Tag])VALUES
(101,201),(101,202),(101,203),(101,204),(101,205),(101,206),(101,207),
(102,208),(102,209),(102,203),(102,207),(102,210)
CREATE TABLE [ChrisC].[Tags]([pkTag] [int] IDENTITY(201,1) NOT NULL,[Tag] [varchar](40) NULL) ON [PRIMARY]
INSERT INTO [ChrisC].[Tags]([Tag])VALUES
('bags'),('football'),('shoes'),('soccer'),('sports'),('track-pants'),('t-shirts'),('jeans'),('perfumes'),('wallets')
SELECT b.[Name], t.Tag
FROM chrisc.Brands b
LEFT JOIN chrisc.BrandTags bt ON pkBrand = Brand
LEFT JOIN chrisc.Tags t ON bt.Tag = t.pkTag
WHERE b.[Name] = 'Nike'
-- Stop execution here to see the tables with data
DROP TABLE [ChrisC].[Brands]
DROP TABLE [ChrisC].[BrandTags]
DROP TABLE [ChrisC].[Tags]
END
IF EXISTS (SELECT * FROM sys.schemas WHERE name = N'ChrisC') DROP SCHEMA [ChrisC]
END

Modifying tags in a web app. Best practice

Thinking it through again, I suppose I will have a function which creates the tag string for display on the update form.

If, on update, I were to call that function again, and compare the string, at least I'd be able to filter out where the tags have not been changed at all with a single db call.

Then, only if the tags are different, I could do the update.

Could be quite efficient?

Best database schema for tagging system with a low and constant number tags

Before you choose any denormalized design, you must have a solid idea of the queries you're going to run against the data. Denormalized designs optimize for a certain subset of queries, at the expense of other queries.

For example, do you need to search for specific tags? If so, you need some way to find tags via an index. Do not use LIKE '%word%' queries to search for substrings; it will run hundreds or thousands of times slower than a query that uses an index.

Do you need to count occurrences of tags? If so, you need a solution that restricts duplicates. The only design you listed that prevents duplicates is Toxi.

I wouldn't use any solution that requires the use of MyISAM FULLTEXT indexes. MyISAM is too fragile and susceptible to data corruption. Don't store any data in MyISAM that you're not prepared to regenerate or restore from backup regularly.

Scalable Database Tagging Schema

Bill I think I kind of threw you off, the notes are just in another table and there is a separate table with notes posted by different people. Posts have notes and tags, but tags also have notes, which is why tags are UNIQUE.

Jonathan is right about linked lists, I wont use them at all. I decided to implement the tags in the simplest normalized way that meats my needs:

DROP TABLE IF EXISTS `tags`;
CREATE TABLE IF NOT EXISTS `tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `name` (`name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

DROP TABLE IF EXISTS `posts`;
CREATE TABLE IF NOT EXISTS `posts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`name` varchar(255) NOT NULL,
`content` TEXT NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

DROP TABLE IF EXISTS `posts_notes`;
CREATE TABLE IF NOT EXISTS `posts_notes` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`date` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
`note` TEXT NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

DROP TABLE IF EXISTS `posts_tags`;
CREATE TABLE IF NOT EXISTS `posts_tags` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`owner` int(10) unsigned NOT NULL,
`tagId` int(10) unsigned NOT NULL,
`postId` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
FOREIGN KEY (`postId`) REFERENCES posts(`id`) ON DELETE CASCADE,
FOREIGN KEY (`tagId`) REFERENCES tags(`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=1 ;

I'm not sure how fast this will be in the future, but it should be fine for a while as only a couple people use the database.

How to efficiently store tags for items in relational database?

Generally speaking, is making all attributes in an entity a concat PK
possible, or is it a bad practice? A product may have many tags and
how do I store these tags for each product?

In the most typical use case for tags, a tag may apply to more than one record. For the most normalized design, you've nearly arrived at the best design:

  • A tag may be used on (may refer to) several products
  • A product may have several tags

This suggests a many-to-many relationship between Tags and Products. This can be readily implemented with a bridge table. An advantage of this design is that the design of each Tag (and each Product) record does not incorporate a foreign key at all and is therefore simpler. The records of the bridge table describe the relationships between Tags and Products.



Related Topics



Leave a reply



Submit