Optimizing MySQL Fulltext Search

Optimizing mysql fulltext search

Based on the question's follow-up comments, you've a btree index on your columns rather than a full text index.

For MATCH (title,content) against search, you would need:

CREATE FULLTEXT INDEX index_name ON tbl_name (title,content);

I'm not sure it'll accept the date field in there (the latter is probably not relevant anyway).

Optimising the MySQL full text search Query?

Replace all tests REGEXP '.*' with IS NOT NULL.

Leverage the fulltext indexes, replace tests like fabric REGEXP 'Cotton|Nylon' with MATCH(fabric) AGAINST ("+Cotton" IN BOOLEAN MODE) OR MATCH(fabric) AGAINST ("+Nylon" IN BOOLEAN MODE)

Normalize your database. A column should not contain non-scalar values (such as CSV data). Instead, a one-to-many relationship should be established with a new table.

Optimize mysql query with full text index and other column(s) w/ indexes

There is a serious datatype inconsistency here:

`item_id` int(10) NOT NULL AUTO_INCREMENT,
phppos_items.item_id = 'Search'

This is especially messy to try to optimize:

WHERE (MATCH (phppos_items.name, phppos_items.item_number, product_id, description)
AGAINST ('"Search* "' IN BOOLEAN MODE)
or phppos_items.item_id = 'Search')
and phppos_items.deleted=0
and system_item = 0;

Essentially the only way to perform the query is to check every row of the table. Furthermore, FT tests like to be "in the driver's seat", but this does not allow such.

Step one is to get rid of the OR:

( SELECT ...
WHERE MATCH(..) AGAINST(..)
AND phppos_items.deleted=0
AND system_item = 0 )
UNION DISTINCT
( SELECT ...
WHERE phppos_items.item_id = 'Search'
AND phppos_items.deleted=0
AND system_item = 0 )

The first SELECT will do the FT test (very fast), then filter out any rows based on the 0-tests.

The second SELECT will simply use the PRIMARY KEY(item_id) (unless you have a typo!) and check the numeric item_id against zero and presumably fail to find any rows.

Then the UNION will collect the two resultsets, dedup them and deliver the results very fast.

(Turning OR into UNION is a general optimization technique; it seems especially useful for your query.)

I find it rarely wise to have more than 2 UNIQUE keys on a table. Are you sure you have 3?

Optimize MySQL FULL TEXT search

Did you try to use "UNION" 2 sets of results instead of using the "OR" operator in the "WHERE" clause? Because I'm just afraid of the index can not be used with the "OR" operator.

The query will be something like this:

SELECT td__user.*,
td__user_oauth.facebook_id,
td__user_oauth.google_id
FROM td__user
LEFT JOIN td__user_oauth ON td__user.id = td__user_oauth.user_id
WHERE td__user.id LIKE :id

UNION
SELECT td__user.*,
td__user_oauth.facebook_id,
td__user_oauth.google_id
FROM td__user
LEFT JOIN td__user_oauth ON td__user.id = td__user_oauth.user_id
WHERE MATCH (email, firstname, lastname) AGAINST (:match IN BOOLEAN MODE))

ORDER BY date_accountcreated DESC LIMIT 20 OFFSET 0

Hope this can help!

How to optimize MySQL Boolean Full-Text Search? (Or what to replace it with?) - C#

First, you should realize that RDBMS support for full text indexing is a hack to force a technology designed to allow efficient access to structured data to deal with unstructured text. (Yes, that's just my opinion. If required, I can defend it as I understand both technologies extremely well. ;)

So, what can be done to improve search performance?

Option One - "The Best Tool For The Task"

The best way to handle full-text search within a corpus of documents is the use technology specifically designed to do so, such as SOLR (Lucene) from Apache or Sphinx from err, Sphinx.

For reasons that will become clear below, I strongly recommend this approach.

Option Two - Preload Your Results

When constructing text-based search solutions, the usual approach is to index all documents into a single searchable index and while this might be the most expedient, it is not the only approach.

Assuming what you're searching for can be easily quantified into a set of known rules, you could offer more of a "guided" style of search than simply unqualified full-text. What I mean by this is, if your application might benefit from guilding users to results, you can preload various sets of results based on a known set of rules into their own tables, and thus reduce the bulk of data to be searched.

If you expect a majority of your users will benefit from a known set of search terms in a known order, you can construct your search UI to favor those terms.

So assuming a majority of users are looking for a variety of automobile, you might offer predefined searches based on model, year, condition, etc. Your search UI would be crafted as a series of dropdown menus to "guide" users to specific results.

Or if a majority of searches will be for a specific main topic (say 'automobiles') you could predefine a table of only those records you've previously identified as being related to automobiles.

Both of these approaches would reduce the number of records to be searched and so, increase response times.

Option Three - "Roll Your Own"

If you cannot integrate an external search technology into your project and preloading isn't an option, there are still ways to vastly improve search query response times, but they differ based on what you need to accomplish and how you expect searches to be carried out.

If you expect users to search using single keywords or phrases and boolean relationships between them, you might consider constructing your own 'inverted index' of your corpus. (This is what MySQL's Boolean Full-Text Search already does, but doing it yourself allows greater control over both the speed and accuracy of search.)

To build an inverted index from your existing data:

Step 1. Create three tables


// dict - a dictionary containing one row per unique word in corpus
create table dict (
id int primary key,
word varchar
)

// invert - an inverted_index to map words to records in corpus
create table invert (
id int primary key,
rec_id int,
word_id int
)

// stopwords - to contain words to ignore when indexing (like a, an, the, etc)
create table stopwords (
id int primary key,
word varchar
)

Note: This is just a sketch. You'll want to add indexes and constraints, etc. when you actually create these tables.

The stopwords table is used to reduce the size of your index to only those words that matter to users' expected queries. For example, it's rarely useful to index English articles, like 'a', 'an', 'the', since they do not contribute useful meaning to keyword searches.

Typically, you'll require a stopword list specifically crafted to the needs of your application. If you never expect users to include the terms 'red', 'white' or 'blue' in their queries or if these terms appear in every searchable record, you would want to add them to your stopword list.

See the note at the end of this message for instructions on using your own stopwords list in MySQL.

See also:

  • The current list of stopwords supported in MySQL

  • A good starting stopword list in English

Step 2. Build the Inverted Index

To build an inverted index from your existing records, you'll need to (pseudo-code):


foreach( word(w) in record(r) ) {
if(w is not in stopwords) {
if( w does not exist in dictionary) {
insert w to dictionary at w.id
}
insert (r.id, w.id) into inverted_index
}
}
More on stopwords:

nstead of using a specific stopword list, the 'if(w is not in stopwords)' test could make other decisions either instead of or as an adjunct to your list of unacceptable words.

Your application might wish to filter out all words less than 4 characters long or to only include words from a predefined set.

By creating your own inverted index, you gain far greater and finer-grained control over search.

Step 3. Query the Inverted Index Using SQL

This step really depends on how you expect queries to submitted to your index.

If queries are to be 'hard-coded', you can simply create the select statement yourself or if you need to support user-entered queries, you'll need to convert whatever query language you choose into an SQL statement (typically done using a simple parser).

Assuming you wish to retrieve all documents matching the logical query '(word1 AND word2) OR word3', a possible approach might be:

CREATE TEMPORARY TABLE temp_results ( rec_id int, count int ) AS 
( SELECT rec_id, COUNT(rec_id) AS count
FROM invert AS I, dict AS D
WHERE I.word_id=D.id AND (D.word='word1' OR D.word='word2')
GROUP BY I.rec_id
HAVING count=2
)
UNION (
SELECT rec_id, 1 AS count
FROM invert AS I, dict AS D
WHERE I.word_id=D.id AND D.word='word3'
);

SELECT DISTINCT rec_id FROM temp_results;

DROP TABLE temp_results;

NOTE: This is just a first pass off the top of my head. I am confident there are more efficient ways of converting a boolean query expression into an efficient SQL statement and welcome any and all suggestions for improvement.

To search for phrases, you'll need to add a field to the inverted index to represent the position the word appeared within its record and factor that into your SELECT.

And finally, you'll need to update your inverted index as you add new records or delete old ones.

Final Word

"Full text search" falls under a very large area of research known as "Information Retrieval" or IR and there are many books on the subject, including

  • Information Retrieval: Implementing and Evaluating Search Engines by Stefan Büttcher, Charles L. A. Clarke and Gordon V. Cormack (Jul 23, 2010)

  • Search Engines: Information Retrieval in Practice by Bruce Croft, Donald Metzler and Trevor Strohman (Feb 16, 2009)

  • Building Search Applications: Lucene, LingPipe, and Gate by Manu Konchady (Jun 2008)

Check Amazon for more.

Notes

How To Use Your Own List of Stopwords in MySQL

To use your own stopword list in MySQL:

  1. Create your own list of stopwords, one word per line, and save it to a known location on your server, say: /usr/local/lib/IR/stopwords.txt


  2. Edit my.cnf to add or update the following lines:

    [mysqld]
    ft_min_word_len=1
    ft_max_word_len=40
    ft_stopword_file=/usr/local/lib/IR/stopwords.txt

    which will set the minimum and maximum length of legal words to 1 and 40,
    respectively, and tell mysqld where to find your custom list of stopwords.

    (Note: the default ft_max_word_len is 84, which I believe is pretty excessive
    and can cause runs of strings that are not real words to be indexed.)

  3. Restart mysqld


  4. Drop and recreate all full-text related indices


Related Topics



Leave a reply



Submit