Efficient Implementation of Faceted Search in Relational Databases

Efficient implementation of faceted search in relational databases

I can only confirm what Nils says. RDBMS are not good for multi-dimensional searching. I have worked with some smart solutions, caching counters, using triggers, and so on. But in the end, external dedicated indexer always wins.

MAYBE, if you transform your data into dimensional model and feed it to some OLAP [I mean MDX engine] - it will perform well. But it seems a bit too heavy solution, and it will be definitely NOT real-time.

On the contrary, solution with dedicated indexing engine (think Lucene, think Sphinx) can be made near-real time with incremental index updates.

How does Lucene/Solr achieve high performance in multi-field / faceted search?

Faceting

There are two answers for faceting, because there are two types of faceting. I'm not certain that either of these are faster than an RDBMS.

  1. Enum faceting. Results of a query are a bit vector where the ith bit is 1 if the ith document was a match. The facet is also a bit vector, so intersection is just a bitwise AND. I don't think this is a novel approach, and most RDBMS's probably support it.
  2. Field Cache. This is just a normal (non-inverted) index. The SQL-style query that is run here is like:

    select facet, count(*) from field_cache
    where docId in query_results
    group by facet

Again, I don't think this is anything that a normal RDBMS couldn't do. The index is a skip list, with the docId as the key.

Multi-term search

This is where Lucene shines. Why Lucene's approach is so good is too long to post here, but I can recommend this post on Lucene Performance, or the papers linked therein.

Dedicated faceted search engine for dealing with dynamic taxonomies - helps just with performance or also flexibilty?

I don't claim to have a definitive answer to all of this (it's a rather open-ended question which you should try to break into smaller parts and it depends on your actual requirements, in fact I'm tempted to vote to close it) but I will comment on a few things:

  1. I would forget about modelling this on a RDBMS. Faceted search just doesn't work in a relational schema.
  2. IMO this is not the right place for code generation. You should design your code so it doesn't change with data changes (I'm not talking about schema changes).
  3. Storing metadata / attributes on an Excel spreadsheet seems like a very bad idea. I'd build a UI to edit this, which would be stored on Solr / MongoDB / CouchDB / whatever you choose to manage this.
  4. Solr does not "just mirror relational DB". In fact, Solr is completely independent of relational databases. One of the most common cases is dumping data from a RDBMS to Solr (denormalizing data in the process), but Solr is flexible enough to work without any relational data source.
  5. Hierarchical faceting in Solr is still an open issue in research. Currently there are two separate approaches being researched (SOLR-64, SOLR-792)

How to implement faceted search with cypher?

Yes you can combine match [pattern] with where [pattern]. Example:

// Tagged A and B
MATCH (img)<-[*]-(:TAG {name:'Tag Name A'})
WHERE (img)<-[*]-(:TAG {name:'Tag Name B'})

// Tagged A but not B
MATCH (img)<-[*]-(:TAG {name:'Tag Name A'})
WHERE NOT (img)<-[*]-(:TAG {name:'Tag Name B'})

See Using Patterns in Where (https://neo4j.com/docs/cypher-manual/current/clauses/where/#query-where-patterns)

Lucene.NET Faceted Search

Ok, so I finished my implementation. I did a lot of digging in the Lucene and Solr source code in the process and I'd recommend not using the implementation described in the linked question for several reasons. Not the least of which is that it relies on a depreciated method. It is needlessly clever; just writing your own collector will get you faster code that uses less RAM.

MySQL and faceted navigation (filter by attributes)

"I then use PHP code to remove duplicates"

It will not scale then.

After I read http://www.amazon.com/Data-Warehouse-Toolkit-Techniques-Dimensional/dp/0471153370 I was rolling out facets & filtering mechanisms non stop.

The basic idea is you use a star schema..

You create a fact table that stores facts

customerid | dateregisteredid | datelastloginid
1 | 1 | 1
2 | 1 | 2

You use foreign keys into dimension tables that store attributes

date_registered
Id | weekday | weeknumber | year | month | month_year | daymonth | daymonthyear
1 | Wed | 2 | 2009 | 2 |2-2009 | 4 | 4-2-2009

Then whichver date "paradigm" you are using, grab all the ids from that dimension table and

 select * from the fact table where the fact.dateregisteredid is IN( ... the ids from the date dimension table that represent your time period)

These "indexed views" of your data should reside in a seperate database, and a change to an object in production should queue that record for re-indexing in the analytics system. Large sites might batch their records at non-peak times to the stats reporting application always lags behind a few hours or days. I always try to keep it up to the second, if the architecture supports it.

If you are displaying rowcount previews, you might have quite some optimization or caching to implement as well.

Basically to sum it up, you copy data and denormalize. The technique goes by the name "data warehousing" or OLAP (online analytics processing).

There are better ways, using commercial databases like Oracle, but the star schema makes it available to anyone with an open source relational database and some time.

You should definitely read the toolkit but he discusses a lot of things that can save you considerable time. Like strategies for dealing with updated data, and retaining audit history in the reporting application. For every problem he outlines multiple solutions, each of which are applicable in different contexts.

It can scale up to millions of rows if you don't take the easy ways out and use a ton of needless joins.



Related Topics



Leave a reply



Submit