Bug in Implemented Tagging System

How to implement tag system

I believe you'll find interesting this blog post: Tags: Database schemas

The Problem: You want to have a database schema where you can tag a
bookmark (or a blog post or whatever) with as many tags as you want.
Later then, you want to run queries to constrain the bookmarks to a
union or intersection of tags. You also want to exclude (say: minus)
some tags from the search result.

“MySQLicious” solution

In this solution, the schema has got just one table, it is denormalized. This type is called “MySQLicious solution” because MySQLicious imports del.icio.us data into a table with this structure.

Sample ImageSample Image

Intersection (AND)
Query for “search+webservice+semweb”:

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags LIKE "%semweb%"

Union (OR)
Query for “search|webservice|semweb”:

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"

Minus
Query for “search+webservice-semweb”

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags NOT LIKE "%semweb%"

“Scuttle” solution

Scuttle organizes its data in two tables. That table “scCategories” is the “tag”-table and has got a foreign key to the “bookmark”-table.

Sample Image

Intersection (AND)
Query for “bookmark+webservice+semweb”:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3

First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).

Union (OR)
Query for “bookmark|webservice|semweb”:

Just leave out the HAVING clause and you have union:

SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

Minus (Exclusion)
Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.


“Toxi” solution

Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by wordpress.
The queries are quite the same as in the “scuttle” solution.

Sample Image

Intersection (AND)
Query for “bookmark+webservice+semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3

Union (OR)
Query for “bookmark|webservice|semweb”

SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id

Minus (Exclusion)
Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.

SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.

How to implement tagging system similar to SO in php/mysql?

Before we go into premature optimization mode, it may be useful to look into the following query template. If nothing else this could be used as a baseline against which the effectiveness of possible optimizations can be measured.

SELECT T.Tagid, TagInfo.TagName,  COUNT(*)
FROM Items I
JOIN Tags TagInfo ON TagInfo.TagId = T.TagId
JOIN ItemTagMap T ON I.ItemId = T.ItemId
--JOIN ItemTagMap T1 ON I.ItemId = T1.ItemId
WHERE I.ItemId IN
(
SELECT ItemId
FROM Items
WHERE -- Some typical initial search criteria
Title LIKE 'Bug Report%' -- Or some fulltext filter instead...
AND ItemDate > '02/22/2008'
AND Status = 'C'
)
--AND T1.TagId = 'MySql'
GROUP BY T.TagId, TagInfo.TagName
ORDER BY COUNT(*) DESC

The subquery is the "driving query", i.e. the one corresponding to the end-user's initial criteria. (see below for details on how this query, required multiple times may fit in an overall optimized flow)
Commented is the JOIN on T1 (and possibly T2, T3, when several tags are selected), and, with the WHERE clause, the associated criteria. These are needed when the user selects a particular tag, whether as part of the initial search or by refinement. (It may be more efficient to place these joins and where clauses within the sub-query; more on these below)

Discussion...
The "driving query", or a variation thereof is needed for two distinct purposes:

  • 1 to provide the complete list of ItemId which is needed to enumerate all associated tags.

  • 2 to provide the first N ItemId values (N being the display page size), for the purpose of looking up Item detail info in the Item table.

Note that the complete list doesn't need to be sorted (or it may benefit from sorting in a different order), whereby the second list needs to be sorted based on the user's choice (say by Date, descending or by Title, alphabetically ascending). Also note that if there is any sort order required, the cost of the query will imply dealing with the complete list (shy of odd optimization by SQL itself, and/or some denormalization, SQL needs to "see" the last records on that list, in case they belong to the top, sort-wise).

This latter fact, is in favor of having the very same query for both purposes, the corresponding list can be stored in a temporary table. The general flow would be to quickly lookup the top N Item records with their details and returns this to the application at once. The application can then obtain ajax-fashion the list of Tags for refinements. This list would be produce with a query akin the one above, where the subquery is replaced by a "select * from temporaryTable." The odds are good that the SQL optimizer will decide to sort this list (in some cases), let's let it do that, rather than second guessing it and sorting it explicitly.

One other point to consider is to maybe bring the join(s) on ItemTagMap table inside the "driving query" rather that as shown above. It is probably best to do so, both for performance, and because it will produce the right list for the #2 purpose (display of a page of items).

The query/flow described above will likely scale rather well, even on relatively modest hardware; tentatively into the 1/2 Million+ Items, with sustained user searches maybe up to 10 per second. One of the key factor would be the selectivity of the initial search criteria.

Optimization ideas

  • [Depending on the typical search cases and on the data stats] it may make sense to denormalize by bringing (indeed duplicating) some of Items' fields to the ItemTagMap table. Short fields in particular may be 'welcome' there.
  • As the data grows in the million+ Items, we could exploit the typically strong correlation of some tags (ex: in SO, PHP often comes with MySql, btw often for no good reason...), with various tricks. For example the introduction of "multi-Tag" TagIds could render the input logic a bit more complicated, but could also reduce the Map size significantly.


-- 'nough said! --

Appropriate architecture and optimizations should be selected in light of the actual requirements and of the effective data statistical profile...

Correct formulation of the A* algorithm

The first approach is optimal only if the optimal path to any repeated state is always the first to be followed. This property holds if the heuristic function has the property of consistency (also called monoticity). A heuristic function is consistent if, for every node n and every successor n' of n, the estimated cost of reaching the goal from n is no greater than the step cost of getting to n' from n plus the estimated cost of reaching the goal from n.

The second approach is optimal if the heuristic function is merely admissible, that is, it never overestimates the cost to reach the goal.

Every consistent heuristic function is also admissible. Although consistency is a stricter requirement than admissibility, one has to work quite hard to concoct heuristic functions that are admissible but not consistent.

Thus, even though the second approach is more general as it works with a strictly larger subset of heuristic functions, the first approach is usually sufficient in practice.

Reference: the subsection A* search: Minimizing the total estimated solution cost in section 4.1 Informed (Heuristic) Search Strategies of the book Artificial Intelligence: A Modern Approach.

How would you reproduce a tagging system like the one StackOverflow uses?

You're missing a key ingredient of how StackOverflow does its search. SO requires that the user delineate the tags in the search string by explicitly putting brackets around the tags. The (probably simplified) logic would then be.

  1. Extract marked tags using regex to detect contents inside brackets
  2. Using list of most common tags, scan string for unmarked tags and extract them.
  3. Remove tag meta characters
  4. Perform full-text search, filtered by tags

Proper usage of Tags in SCM

From an SCM agnostic point of view, a tag is very different from a revision.

Both may be implemented in the same way, both represents a "time line", but their goal is different:

  • a tag represent an immutable state where all files are referenced by a unique id. It is a name representing many things but mainly a stable state, ...)
  • a revision represent a commit transaction (not all SCM have those, especially the old ones with a 'file-by-file approach'). All commits do not represent a "stable" state (as in "compile" or "execute" successfully). They are just a new element of the global history.

The problem with SVN is that revision, tag and branches are all implemented the same.

But I would still prefer the option where a tag is used as a "read-only" branch.



Related Topics



Leave a reply



Submit