When Should You Use Full-Text Indexing

When should you use full-text indexing?

It will depend upon your DBMS. I believe that most systems will not take advantage of the full-text index unless you use the full-text functions. (e.g. MATCH/AGAINST in mySQL or FREETEXT/CONTAINS in MS SQL)

Here is two good articles on when, why, and how to use full-text indexing in SQL Server:

  1. How To Use SQL Server Full-Text Searching
  2. Solving Complex SQL Problems with Full-Text Indexing

What is a fulltext index and when should I use it?

In databases indices are usually used to enhance performance when looking for something defined in your where clause. However when it comes to filtering some text, e.g. using something like WHERE TextColumn LIKE '%searchstring%' then searches are slow, because the way regular database indices work are optimized for matches against the 'whole content' of a column and not just a part of it. In specific the LIKE search which includes wildcards can not make use of any kind of index.

As mentioned in the comment below MySQL needs the MATCH () ... AGAINST syntax to search within a fulltext index; BTW this varies depending on the database vendor. In MS SQL you can use CONTAINS so keep this in mind when you plan to support other databases too.

Fulltext indices work better for regular text, because they are optimized for these type of columns. Very simplified: They split the text into words and make an index over the words and not the whole text. This works a lot faster for text searches when looking for specific words.

SQL full text search vs LIKE

Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.

In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.

What is Full Text Search vs LIKE

In general, there is a tradeoff between "precision" and "recall". High precision means that fewer irrelevant results are presented (no false positives), while high recall means that fewer relevant results are missing (no false negatives). Using the LIKE operator gives you 100% precision with no concessions for recall. A full text search facility gives you a lot of flexibility to tune down the precision for better recall.

Most full text search implementations use an "inverted index". This is an index where the keys are individual terms, and the associated values are sets of records that contain the term. Full text search is optimized to compute the intersection, union, etc. of these record sets, and usually provides a ranking algorithm to quantify how strongly a given record matches search keywords.

The SQL LIKE operator can be extremely inefficient. If you apply it to an un-indexed column, a full scan will be used to find matches (just like any query on an un-indexed field). If the column is indexed, matching can be performed against index keys, but with far less efficiency than most index lookups. In the worst case, the LIKE pattern will have leading wildcards that require every index key to be examined. In contrast, many information retrieval systems can enable support for leading wildcards by pre-compiling suffix trees in selected fields.

Other features typical of full-text search are

  • lexical analysis or tokenization—breaking a
    block of unstructured text into
    individual words, phrases, and
    special tokens
  • morphological
    analysis, or stemming—collapsing variations
    of a given word into one index term;
    for example, treating "mice" and
    "mouse", or "electrification" and
    "electric" as the same word
  • ranking—measuring the
    similarity of a matching record to
    the query string

Fulltext search vs standard database search

There's a few advantages to full text searching.

Indexing:

Something like:

WHERE Foo LIKE '%Bar';

Cannot take advantage of an index. It has to look at every single row, and see if it matches. A fulltext index, however, can. In fact, fulltext indexes can offer a lot more flexibility in terms of the order of matching words, how close those words are together, etc.

Stemming:

A fulltext search can stem words. If you search for run, you can get results for "ran" or "running". Most fulltext engines have stem dictionaries in a variety of languages.

Weighted Results:

A fulltext index can encompass multiple columns. For example, you can search for "peach pie", and the index can include a title, keywords, and a body. Results that match the title can be weighted higher, as more relevant, and can be sorted to show near the top.

Disadvantages:

A fulltext index can potentially be huge, many times larger than a standard B-TREE index. For this reason, many hosted providers who offer database instances disable this feature, or at least charge extra for it. For example, last I checked, Windows Azure did not support fulltext queries.

Fulltext indexes can also be slower to update. If the data changes a lot, there might be some lag updating indexes compared to standard indexes.

Why I would bother using full text search?

I think you have answered your own question, at least to your own satisfaction. If your prototyping produces results in an acceptable amount of time, and you are certain that caching does not explain the quick response (per Paul Sasik), by all means skip the overhead of full-text indexing and proceed with LIKE.

Why is SQL Server Full Text Search indexing SCR or SUR acronym followed by a number, together?

Finally I was able to determine that the issue is related to a currency symbol (apparently SUR and SCR are currency symbols) followed or preceded by a number, causes both to be indexed together.

In my opinion this might be a desired behaviour only if user expects past (SUR - Soviet Ruble, not in use since 1993) or current (SCR - Seychelles Rupee) currencies to be present in text and only if the currency symbol follows or precedes the number according to standards (for example $ precedes the number, SCR or € follows the number).

Moreover, currency symbols seem to be partially affecting Neutral language breaker - past currencies like SUR are fine but current currencies affecting language-neutral word breaking is an entirely unexpected behaviour considering language neutral text processing should not be affected by any dictionary words.

Microsoft documentation of SQL Server 2012 and up FTS text processing explains relevant changes to a word breaker, showing that a new word breaker does not index neither currency symbol or a number separately, even in a language neutral word-breaker:















































termpreviousnew
100$100$100$
100$nn100nn100usd
$100 000 USD$100$100 000 usd
$100 000 USD000
$100 000 USDnn000
$100 000 USDnn100$
$100 000 USDusd

Full-text indexing sluggish. Looking for alternatives

Several suggestions, based around the fact that you have only 6000 rows, so the database should eat this alive.

A. Try using Like operator, just in case it helps. Not expecting it too, but pretty trivial to try. There is something else going on here overall for you to detect this is slow given these small volumes.

B. can you cache queries in advance? With 6000 rows, there are probably only 36*36 combinations of 2 character queries, which should take virtually no memory and save the database any work.

C. Moving the selection out to the client is a good idea, depends on how big the 6000 rows are overall, vs network latency for individual lookups.

D. Combining b and c will give you really good performance I suspect, but with some coding effort required. If the server maintains a list of all single character results in cache, and clients download the letter cache set after first keystroke, then they potentially have a subset of all rows, but won't need to do more network IO for additional keystrokes.



Related Topics



Leave a reply



Submit