Solr or Sphinx? Which Is Better

solr or sphinx? which is better?

I have no experience with Solr, but Sphinx is easy to install, fast and works great with Thinking Sphinx: http://freelancing-god.github.com/ts/en/indexing.html

There is also a good railscast:
http://railscasts.com/episodes/120-thinking-sphinx

This guy gives you some arguments why to go with Sphinx:
http://jamesgolick.com/tags/ultrasphinx.html
(He uses the Ultrasphinx plugin to connect Rails and Sphinx. I tried both and ended up using Thinking Sphinx)

You can find a comparison of both plugins here:
http://reinh.com/blog/2008/07/14/a-thinking-mans-sphinx.html

Choosing a stand-alone full-text search server: Sphinx or SOLR?

I've been using Solr successfully for almost 2 years now, and have never used Sphinx, so I'm obviously biased.
However, I'll try to keep it objective by quoting the docs or other people. I'll also take patches to my answer :-)

Similarities:

  • Both Solr and Sphinx satisfy all of your requirements. They're fast and designed to index and search large bodies of data efficiently.
  • Both have a long list of high-traffic sites using them (Solr, Sphinx)
  • Both offer commercial support. (Solr, Sphinx)
  • Both offer client API bindings for several platforms/languages (Sphinx, Solr)
  • Both can be distributed to increase speed and capacity (Sphinx, Solr)

Here are some differences:

  • Solr, being an Apache project, is obviously Apache2-licensed. Sphinx is GPLv2. This means that if you ever need to embed or extend (not just "use") Sphinx in a commercial application, you'll have to buy a commercial license (rationale)
  • Solr is easily embeddable in Java applications.
  • Solr is built on top of Lucene, which is a proven technology over 8 years old with a huge user base (this is only a small part). Whenever Lucene gets a new feature or speedup, Solr gets it too. Many of the devs committing to Solr are also Lucene committers.
  • Sphinx integrates more tightly with RDBMSs, especially MySQL.
  • Solr can be integrated with Hadoop to build distributed applications
  • Solr can be integrated with Nutch to quickly build a fully-fledged web search engine with crawler.
  • Solr can index proprietary formats like Microsoft Word, PDF, etc. Sphinx can't.
  • Solr comes with a spell-checker out of the box.
  • Solr comes with facet support out of the box. Faceting in Sphinx takes more work.
  • Sphinx doesn't allow partial index updates for field data.
  • In Sphinx, all document ids must be unique unsigned non-zero integer numbers. Solr doesn't even require an unique key for many operations, and unique keys can be either integers or strings.
  • Solr supports field collapsing (currently as an additional patch only) to avoid duplicating similar results. Sphinx doesn't seem to provide any feature like this.
  • While Sphinx is designed to only retrieve document ids, in Solr you can directly get whole documents with pretty much any kind of data, making it more independent of any external data store and it saves the extra roundtrip.
  • Solr, except when used embedded, runs in a Java web container such as Tomcat or Jetty, which require additional specific configuration and tuning (or you can use the included Jetty and just launch it with java -jar start.jar). Sphinx has no additional configuration.

Related questions:

  • Full Text Searching with Rails
  • Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

ElasticSearch, Sphinx, Lucene, Solr, Xapian. Which fits for which usage?

As the creator of ElasticSearch, maybe I can give you some reasoning on why I went ahead and created it in the first place :).

Using pure Lucene is challenging. There are many things that you need to take care for if you want it to really perform well, and also, its a library, so no distributed support, it's just an embedded Java library that you need to maintain.

In terms of Lucene usability, way back when (almost 6 years now), I created Compass. Its aim was to simplify using Lucene and make everyday Lucene simpler. What I came across time and time again is the requirement to be able to have Compass distributed. I started to work on it from within Compass, by integrating with data grid solutions like GigaSpaces, Coherence, and Terracotta, but it's not enough.

At its core, a distributed Lucene solution needs to be sharded. Also, with the advancement of HTTP and JSON as ubiquitous APIs, it means that a solution that many different systems with different languages can easily be used.

This is why I went ahead and created ElasticSearch. It has a very advanced distributed model, speaks JSON natively, and exposes many advanced search features, all seamlessly expressed through JSON DSL.

Solr is also a solution for exposing an indexing/search server over HTTP, but I would argue that ElasticSearch provides a much superior distributed model and ease of use (though currently lacking on some of the search features, but not for long, and in any case, the plan is to get all Compass features into ElasticSearch). Of course, I am biased, since I created ElasticSearch, so you might need to check for yourself.

As for Sphinx, I have not used it, so I can't comment. What I can refer you is to this thread at Sphinx forum which I think proves the superior distributed model of ElasticSearch.

Of course, ElasticSearch has many more features than just being distributed. It is actually built with a cloud in mind. You can check the feature list on the site.

Among Lucene/Solr, Whoosh, Sphinx, Xapian which integrates best with python?

Speaking for Apache Solr, Python has several Solr clients, which I've collected based on feedback from our customers at Websolr:

  1. Haystack is very popular, and designed for seamless integration within Django apps. If you're developing a Django app, Haystack is for you.
  2. Sunburnt looks to be more generic than Haystack, and is also very well documented. If you're doing plain ol' Python, Sunburnt is worth a look.

Other Python Solr clients that I've found, which seem a bit lower level...

  • solrpy
  • pysolr (I know, right?)
  • Insol

Some more details about how your app is built (in particular, is it a Django app?) would help narrow things down from here. Good luck finding the best fit for your app!

Does Solr have an equivalent of strict order operator that Sphinx has?

  • strict order operator: you would need to use SpanQueries for this, look at enter link description here for an explanation of SpanQuery, and in order to use them from Solr, you could try SurroundQParser or else see this other question
  • NEAR, generalized proximity operator: yes, this is supported, see Proximity search
  • SENTENCE/PARAGRAPH: not directly. You could try several approaches:

    • Map somehow those to documents (and maybe use Join functionality in 4.0 to link Paragraph documents to parent documents etc)
    • Try to insert information about paragraphs with special tokens/gaps, see this

Full-text search - should I pick dedicated search engine (SOLR, Elastic) or RDBMS one?

Specialized indexing solutions like Apache Solr, ElasticSearch, Sphinx Search are usually faster than the built-in fulltext indexing of MySQL or GIST of PostreSQL, etc. The specialized solutions often have more features like stemming, more sophisticated searching including faceting, and also storing extra data in a "document" associated with the indexed text.

On the other hand, using one of those complementary solutions means extra complexity to copy data into the indexing solution. How frequently do you need to update the index? Is it efficient to update the index incrementally, or do you basically need to clobber the index and create a fresh index from your whole dataset?

Whereas using the builtin indexing features of your RDBMS have the advantage that the index is probably kept in sync with the most recent data updates automatically. And the search capabilities may be good enough for your needs. Keeping the index maintenance simple and automated has a lot of positive value.

Besides, any of the solutions, even a sub-optimal one, is orders of magnitude better than the naïve approach many developers use: textcolumn LIKE '%keyword%'

what would make you take Apache SOLR or Elastic nowadays, instead of MySQL or other relational database with their increased Full-Text search capabilities?

Better performance, more sophisticated search support, and it helps to move those expensive search queries to a dedicated search engine, and lighten the load on your RDBMS.

Mysql Search vs. Search Tools (CloudSearch,Sphinx,Solr..) For NonText Searches

Most of search servers will not only search but also filter big amounts of data faster than a database. So, if you need a better performance, go with a search server.

Another thing to consider is development cost: any search server requires some effort to configure it and integrate with a system.

I have some experience with Sphinx and I like it. Now I'm trying to integrate its realtime indexes with ORM and avoid any database filtering. Sphinx will search and filter data, return found IDs and InnoDB will just select data by IDs (MySQL and especially InnoDB tables do it really fast).

Just ask yourself "Will DB performance be enough for us?" and make a decision.

Comparison of full text search engine - Lucene, Sphinx, Postgresql, MySQL?

Good to see someone's chimed in about Lucene - because I've no idea about that.

Sphinx, on the other hand, I know quite well, so let's see if I can be of some help.

  • Result relevance ranking is the default. You can set up your own sorting should you wish, and give specific fields higher weightings.
  • Indexing speed is super-fast, because it talks directly to the database. Any slowness will come from complex SQL queries and un-indexed foreign keys and other such problems. I've never noticed any slowness in searching either.
  • I'm a Rails guy, so I've no idea how easy it is to implement with Django. There is a Python API that comes with the Sphinx source though.
  • The search service daemon (searchd) is pretty low on memory usage - and you can set limits on how much memory the indexer process uses too.
  • Scalability is where my knowledge is more sketchy - but it's easy enough to copy index files to multiple machines and run several searchd daemons. The general impression I get from others though is that it's pretty damn good under high load, so scaling it out across multiple machines isn't something that needs to be dealt with.
  • There's no support for 'did-you-mean', etc - although these can be done with other tools easily enough. Sphinx does stem words though using dictionaries, so 'driving' and 'drive' (for example) would be considered the same in searches.
  • Sphinx doesn't allow partial index updates for field data though. The common approach to this is to maintain a delta index with all the recent changes, and re-index this after every change (and those new results appear within a second or two). Because of the small amount of data, this can take a matter of seconds. You will still need to re-index the main dataset regularly though (although how regularly depends on the volatility of your data - every day? every hour?). The fast indexing speeds keep this all pretty painless though.

I've no idea how applicable to your situation this is, but Evan Weaver compared a few of the common Rails search options (Sphinx, Ferret (a port of Lucene for Ruby) and Solr), running some benchmarks. Could be useful, I guess.

I've not plumbed the depths of MySQL's full-text search, but I know it doesn't compete speed-wise nor feature-wise with Sphinx, Lucene or Solr.



Related Topics



Leave a reply



Submit