Fulltext Search With Innodb

Fulltext Search with InnoDB

I can vouch for MyISAM fulltext being a bad option - even leaving aside the various problems with MyISAM tables in general, I've seen the fulltext stuff go off the rails and start corrupting itself and crashing MySQL regularly.

A dedicated search engine is definitely going to be the most flexible option here - store the post data in MySQL/innodb, and then export the text to your search engine. You can set up a periodic full index build/publish pretty easily, and add real-time index updates if you feel the need and want to spend the time.

Lucene and Sphinx are good options, as is Xapian, which is nice and lightweight. If you go the Lucene route don't assume that Clucene will better, even if you'd prefer not to wrestle with Java, although I'm not really qualified to discuss the pros and cons of either.

Does MySQL's FULLTEXT search return the same results for MyISAM and InnoDb?

There are actually some notable differences in the implementation of the MyISAM and InnoDB fulltext searches:

  • The MyISAM natural language search (but not the boolean mode) has a 50% threshold, while InnoDB doesn't, so very (very) common words are excluded in the MyISAM results. There is a remark in the manual about it:

    The 50% threshold can surprise you when you first try full-text searching to see how it works, and makes InnoDB tables more suited to experimentation with full-text searches. If you create a MyISAM table and insert only one or two rows of text into it, every word in the text occurs in at least 50% of the rows. As a result, no search returns any results until the table contains more rows.

  • The MyISAM stopword list (a list of words that is not included in the fulltext index and thus cannot be found) is significantly longer than the (default) one used by InnoDB, so e.g. "everybody" or "unfortunately" can be found with InnoDB, but not with MyISAM. match against ('Mary Had a Little Lamb') will usually contain a lot more results, as "had" is a stopword in MyISAM, but not in InnoDB.

  • MyISAM and InnoDB use different weight algorithms. MyISAM considers e.g. the ratio of matching words to non-matching words in a row, so a long sentence that contains a word is less relevant than a short sentence with that word. Although this will only change the order in the otherwise identical resultset, this oftentimes has a significant impact on the user experience and if the user regards two results as "the same", which is what you are asking about. This might also be particular relevant, as searches usually include a limit, e.g. order by score desc limit 10, which thus can yield completely different results.

  • InnoDB supports "" to match exact phrases (words in given order), while MyISAM (at least in natural language mode) doesn't. So if you use match against ('"Mary Had a Little Lamb"'), InnoDB will only return a row if it contains this exact sentence, while MyISAM will find every row that contains any of these words (apart from "had" as mentioned above, and "a", which is in both stopword lists).

  • Since you are using the natural language mode, deviations in the boolean search are probably not relevant for you, but to list at least one: the two engines differ in how they treat stop (or short) words in the search query. If you use match against ('+about +Mary' in boolean mode) ("about" is a stopword in both engines), InnoDB will try to find that word in the index although it cannot be in there, and thus return no results, while MyISAM will ignore that word and can return results that may not contain "about", only "Mary".

Additionally, the default values for the minimum word length,
ft_min_word_len for MyISAM (default 4) and innodb_ft_min_token_size for InnoDB (default 3) are different, so if you do not adjust them, the InnoDB index will contain (and find) more words. You might also want to adept the stop word list to match each other.

If these differences are relevant in your case will depend on your data, your search patterns and if you consider a different order to be a different result. Searches in data that mainly consists of short terms or fixed formats, e.g. product codes or company names, or searches where you are mainly interested in finding specific words at all, or searches that usually only yield a handful of possible results, will usually vary less in the two engines than searches in actual english texts, where a different relevance score has a bigger effect.

FullText Search Innodb Fails, MyIsam Returns Results

There are several differences between MyISAM's FULLTEXT and InnoDB's. I think you were caught by the handling of 'short' words and/or stop words. MyISAM will show rows, but InnoDB will fail to.

What I have done when using FT (and after switching to InnoDB) is to filter the user's input to avoid short words. It takes extra effort but gets me the rows desired. My case is slightly different since the resulting query is something like this. Note that I have added + to require the words, but not on words shorter than 3 (my ft_min_token_size is 3). These searches were for build a table and build the table:

WHERE match(description) AGAINST('+build* a +table*' IN BOOLEAN MODE)
WHERE match(description) AGAINST('+build* +the* +table*' IN BOOLEAN MODE)

(The trailing * may be redundant; I have not investigated that.)

Another approach

Since FT is very efficient at non-short, non-stop words, do the search with two phases, each being optional: To search for "a long word", do

WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)
AND d REGEXP '[[:<:]]a[[:>:]]'

The first part whittles down the possible rows rapidly by looking for 'long' and 'word' (as words). The second part makes sure there is a word a in the string, too. The REGEXP is costly but will be applied only to those rows that pass the first test.

To search just for "long word":

WHERE MATCH(d) AGAINST ('+long +word' IN BOOLEAN MODE)

To search just for the word "a":

WHERE d REGEXP '[[:<:]]a[[:>:]]'

Caveat: This case will be slow.

Note: My examples allow for the words to be in any order, and in any location in the string. That is, this string will match in all my examples: "She was longing for a word from him."

MySQL fulltext searches in InnoDB error

When you use MATCH() the columns you name must together be indexed as one fulltext index, not each column individually in a separate fulltext index.

Given the query you show, you need an index defined this way:

ALTER TABLE `products` ADD FULLTEXT (`pname`,`description`);

MySQL InnoDB FULLTEXT search rankings for phrase searches

I suspect this is a quirk of phrases. The following documentation ranks among the most inconsistent language that I've seen:

A phrase that is enclosed within double quote (") characters matches
only rows that contain the phrase literally, as it was typed. The
full-text engine splits the phrase into words and performs a search in
the FULLTEXT index for the words.
Nonword characters need not be
matched exactly: Phrase searching requires only that matches contain
exactly the same words as the phrase and in the same order.

The first sentence directly conflicts with the rest of the explanation. I've highlighted the part that I think is important.

So, I'll speculate. MySQL does the phrase matching at the word level. So, more "foo"s and "bar"s in the result -- without "foo bar" add to the score. There is some extra mechanism that makes sure that the pair is actually in the result set, but this does not affect the score.

One thing you could do is your own ordering:

order by length(content) - length(replace(content, 'foo bar', '')) desc

This explicitly looks for "foo bar" in the content, ordering by that value.

INNODB FULLTEXT search with JOIN: search term on different tables

As described by @Akina this doesn't seem to be possible because INNODB doesn't allow FULLTEXT search on two tables.

I'll create a third table only for the FULLTEXT search and fill that with the data from both tables.



Related Topics



Leave a reply



Submit