SQL Full Text Search VS "Like"

What is Full Text Search vs LIKE

In general, there is a tradeoff between "precision" and "recall". High precision means that fewer irrelevant results are presented (no false positives), while high recall means that fewer relevant results are missing (no false negatives). Using the LIKE operator gives you 100% precision with no concessions for recall. A full text search facility gives you a lot of flexibility to tune down the precision for better recall.

Most full text search implementations use an "inverted index". This is an index where the keys are individual terms, and the associated values are sets of records that contain the term. Full text search is optimized to compute the intersection, union, etc. of these record sets, and usually provides a ranking algorithm to quantify how strongly a given record matches search keywords.

The SQL LIKE operator can be extremely inefficient. If you apply it to an un-indexed column, a full scan will be used to find matches (just like any query on an un-indexed field). If the column is indexed, matching can be performed against index keys, but with far less efficiency than most index lookups. In the worst case, the LIKE pattern will have leading wildcards that require every index key to be examined. In contrast, many information retrieval systems can enable support for leading wildcards by pre-compiling suffix trees in selected fields.

Other features typical of full-text search are

  • lexical analysis or tokenization—breaking a
    block of unstructured text into
    individual words, phrases, and
    special tokens
  • morphological
    analysis, or stemming—collapsing variations
    of a given word into one index term;
    for example, treating "mice" and
    "mouse", or "electrification" and
    "electric" as the same word
  • ranking—measuring the
    similarity of a matching record to
    the query string

SQL full text search vs LIKE

Full text search is likely to be quicker since it will benefit from an index of words that it will use to look up the records, whereas using LIKE is going to need to full table scan.

In some cases LIKE will more accurate since LIKE "%The%" AND LIKE "%Matrix" will pick out "The Matrix" but not "Matrix Reloaded" whereas full text search will ignore "The" and return both. That said both would likely have been a better result.

MS Sql Full-text search vs. LIKE expression

I don't think you're going to get the performance you need out of MS SQL; you're going to need to construct very complex queries to cover all the data/tables that you're going to be searching, and you have the added encumbrance of writing data to the database at the same time as you are querying it.

I would suggest you look at either Apache Solr (http://lucene.apache.org/solr/) or Lucene (http://lucene.apache.org). Solr is built on top of Lucene, both can be used to create an inverted file index, basically like the index in the back of book (term 1 appears in documents 1, 3, 7, etc.) Solr is a search-engine-in-a-box, and has several mechanisms that will let you tell it how and where to index data. Lucene is more lower-level, and will let you set up your indexing and searching architecture with more flexibility.

The good thing about Solr is that it's available as a web service, so if you're not familiar with Java, you can find a Solr client in the language of your choice, and write indexing and searching code in whatever language suits you. Here's a link to a list of client libraries for Solr, including some in C# http://wiki.apache.org/solr/IntegratingSolr That's where I'd start.

Performance of like '%Query%' vs full text search CONTAINS query

Full Text Searching (using the CONTAINS) will be faster/more efficient than using LIKE with wildcarding. Full Text Searching (FTS) includes the ability to define Full Text Indexes, which FTS can use. I don't know why you wouldn't define a FTS index if you intended to use the functionality.

LIKE with wildcarding on the left side (IE: LIKE '%Search') can not use an index (assuming one exists for the column), guaranteeing a table scan. I haven't tested & compared, but regex has the same pitfall. To clarify, LIKE '%Search' and LIKE '%Search%' can not use an index; LIKE 'Search%' can use an index.

Fulltext search vs standard database search

There's a few advantages to full text searching.

Indexing:

Something like:

WHERE Foo LIKE '%Bar';

Cannot take advantage of an index. It has to look at every single row, and see if it matches. A fulltext index, however, can. In fact, fulltext indexes can offer a lot more flexibility in terms of the order of matching words, how close those words are together, etc.

Stemming:

A fulltext search can stem words. If you search for run, you can get results for "ran" or "running". Most fulltext engines have stem dictionaries in a variety of languages.

Weighted Results:

A fulltext index can encompass multiple columns. For example, you can search for "peach pie", and the index can include a title, keywords, and a body. Results that match the title can be weighted higher, as more relevant, and can be sorted to show near the top.

Disadvantages:

A fulltext index can potentially be huge, many times larger than a standard B-TREE index. For this reason, many hosted providers who offer database instances disable this feature, or at least charge extra for it. For example, last I checked, Windows Azure did not support fulltext queries.

Fulltext indexes can also be slower to update. If the data changes a lot, there might be some lag updating indexes compared to standard indexes.



Related Topics



Leave a reply



Submit