Redis or Mongo for Determining If a Number Falls Within Ranges

Redis or Mongo for determining if a number falls within ranges?

Contrary to the previous poster, I don't think you can get O(log n) complexity by using naive indexing. Let's consider mongodb for instance. You could define two indexes (for start and end properties of the ranges), but mongodb will only use one to solve a given query. So it will not work. Now if you use a single compound index involving both start and end properties of the ranges, the complexity will be logarithmic to find the first range to check, but then it will get linear to find the last range matching the query. The worst case complexity is O(n), and you have it when all the stored ranges overlap your input.

On a side note, using Redis sorted set you can emulate a sorted index (with O(log n) complexity) if you know what you do. Redis is a bit more than a simple key-value store.
Redis sorted sets are implemented using a skip list, and both the score and value are used to compare items.

To solve this kind of problem, a dedicated indexing structure is needed. You may want to have a look at:

http://en.wikipedia.org/wiki/Segment_tree
or
http://en.wikipedia.org/wiki/Interval_tree

If the concern is speed over space, then it may be interesting to flatten the index.
For instance, let's consider the following ranges (using only integers to simplify the discussion):

A 2-8
B 4-6
C 2-9
D 7-10

A sparse structure indexing non overlapping segments can be built.

0  []
2  [A C]
4  [A C B]
7  [A C D]
9  [C D]
10 [D]
11 []

Each entry contains the lower bound of a non overlapping segment as the key, and the list or set of matching ranges as a value. Entries should be indexed using a sorted container (tree, skip list, btree, etc ...)

To find the ranges matching 5, we look for the first entry which is lower or equal to 5 (it will be 4 in this example) and the list of ranges is provided ( [A C B] )

With this data structure, the complexity of the queries is really O(log n). However it is not trivial (and expensive) to build and maintain it. It can be implemented with both mongodb and Redis.

Here is an example with Redis:

> rpush range:2 2-8 2-9
(integer) 2
> rpush range:4 2-8 2-9 4-6
(integer) 3
> rpush range:7 2-8 2-9 7-10
(integer) 3
> rpush range:9 2-9 7-10
(integer) 2
> rpush range:10 7-10
(integer) 1
> zadd range_index 0 range:0 2 range:2 4 range:4 7 range:7 9 range:9 10 range:10
(integer) 6
> zrevrangebyscore range_index 5 0 LIMIT 0 1
1) "range:4"
> lrange range:4 0 -1
1) "2-8"
2) "2-9"
3) "4-6"

In redis, how to query against a large table efficiently?

Make each col a sorted set, then use ZRANGEBYSCORE on each key, and do the intersection and count in the application. I use phpredis and I do that a lot in memory, using array_intersect.

The perfomance problem is in ZADD, which you will use to create the sorted sets.

Once you have all the sorted sets created in Redis' memory, the rest is really fast.

Creating sorted sets (Redis sample)

ZADD col1 12.3 line1
ZADD col1 79.3 line2
ZADD col1 55.6 line3

ZADD col2 37.4 line1
ZADD col2 -6.3 line2
ZADD col2 -7.3 line3

PHP, finding ranges, intersection and count

$COL1 = $redis->zrangebyscore('col1', -10, 10);
$COL2 = $redis->zrangebyscore('col2', 2010, 2012);
$count = count(array_intersect($COL1, $COL2));

Hope that helps.

Best solution for finding 1 x 1 million set intersection? Redis, Mongo, other

This is an interesting problem, and I think Redis can help here.

Redis can store sets of integers using an optimized "intset" format. See http://redis.io/topics/memory-optimization for more information.

I believe the correct data structure here is a collection of targeted tag sets, plus a reverse index to map tags to their targeted tag sets.

To store two targeted tag sets:

 0 -> [ 1 2 3 4 5 6 7 8 ]
 1 -> [ 6 7 8 9 10 ]

I would use:

 # Targeted tag sets
 sadd tgt:0 1 2 3 4 5 6 7 8
 sadd tgt:1 2 6 7 8 9 10
 # Reverse index
 sadd tag:0 0
 sadd tag:1 0
 sadd tag:2 0 1
 sadd tag:3 0
 sadd tag:4 0
 sadd tag:5 0
 sadd tag:6 0 1
 sadd tag:7 0 1
 sadd tag:8 0 1
 sadd tag:9 1
 sadd tag:10 1

This reverse index is quite easy to maintain when targeted tag sets are added/removed from the system.

The global memory consumption depends on the number of tags which are common to multiple targeted tag sets. It is quite easy to store pseudo-data in Redis and simulate the memory consumption. I have done it using a simple node.js script.

For 1 million targeted tag sets (tags being 8 digits numbers, 40 tags per set), the memory consumption is close to 4 GB when there are very few tags shared by the targeted tag sets (more than 32M entries in the reverse index), and about 500 MB when the tags are shared a lot (only 100K entries in the reverse index).

With this data structure, finding the targeted tag sets containing all the tags of a given customer is extremely efficient.

1- Get customer tag set (suppose it is 1 2 3 4)
2- SINTER tag:1 tag:2 tag:3 tag:4
   => result is a list of targeted tag sets having all the tags of the customer

The intersection operation is efficient because Redis is smart enough to order the sets per cardinality and starts with the set having the lowest cardinality.

Now I understand you need to implement the converse operation (i.e. finding the targeted tag sets having all their tags in the customer tag set). The reverse index can still help.

Here in an example in ugly pseudo-code:

1- Get customer tag set (suppose it is 1 2 3 4)
2- SUNIONSTORE tmp tag:1 tag:2 tag:3 tag:4
   => result is a list of targeted tag sets having at least one tag in common with the customer
3- For t in tmp (iterating on the selected targeted tag sets)
      n = SCARD tgt:t (cardinality of the targeted tag sets)
      intersect = SINTER customer tgt:t
      if n == len(intersect), this targeted tag set matches

So you never have to test the customer tag set against 1M targeted tag sets. You can rely on the reverse index to restrict the scope of the search to an acceptable level.

Best NoSql for querying date ranges?

I've decided to go with Mongo for the time being.

I found that setup/deployment was relatively easy, and the C# wrapper was adequate for what we're trying to do (and in the cases where its not we can resort to javascript queries easily).

How much faster is Redis than mongoDB?

Rough results from the following benchmark: 2x write, 3x read.

Here's a simple benchmark in python you can adapt to your purposes, I was looking at how well each would perform simply setting/retrieving values:

#!/usr/bin/env python2.7
import sys, time
from pymongo import Connection
import redis

# connect to redis & mongodb
redis = redis.Redis()
mongo = Connection().test
collection = mongo['test']
collection.ensure_index('key', unique=True)

def mongo_set(data):
    for k, v in data.iteritems():
        collection.insert({'key': k, 'value': v})

def mongo_get(data):
    for k in data.iterkeys():
        val = collection.find_one({'key': k}, fields=('value',)).get('value')

def redis_set(data):
    for k, v in data.iteritems():
        redis.set(k, v)

def redis_get(data):
    for k in data.iterkeys():
        val = redis.get(k)

def do_tests(num, tests):
    # setup dict with key/values to retrieve
    data = {'key' + str(i): 'val' + str(i)*100 for i in range(num)}
    # run tests
    for test in tests:
        start = time.time()
        test(data)
        elapsed = time.time() - start
        print "Completed %s: %d ops in %.2f seconds : %.1f ops/sec" % (test.__name__, num, elapsed, num / elapsed)

if __name__ == '__main__':
    num = 1000 if len(sys.argv) == 1 else int(sys.argv[1])
    tests = [mongo_set, mongo_get, redis_set, redis_get] # order of tests is significant here!
    do_tests(num, tests)

Results for with mongodb 1.8.1 and redis 2.2.5 and latest pymongo/redis-py:

$ ./cache_benchmark.py 10000
Completed mongo_set: 10000 ops in 1.40 seconds : 7167.6 ops/sec
Completed mongo_get: 10000 ops in 2.38 seconds : 4206.2 ops/sec
Completed redis_set: 10000 ops in 0.78 seconds : 12752.6 ops/sec
Completed redis_get: 10000 ops in 0.89 seconds : 11277.0 ops/sec

Take the results with a grain of salt of course! If you are programming in another language, using other clients/different implementations, etc your results will vary wildy. Not to mention your usage will be completely different! Your best bet is to benchmark them yourself, in precisely the manner you are intending to use them. As a corollary you'll probably figure out the best way to make use of each. Always benchmark for yourself!

Get a range of keys with redis?

Options 2, put Ids into sorted set then use mget to get values out, if your keys are hashes then you need to issue multiple hget, but the advantage is that you can pull out specific parts of the object that you actually need instead of everything. It is very fast in practice.

More than 4 billion key value pairs in Redis?

You can store 4B items in Redis with no specific degradation of performance, but you need the memory for this (i.e. everything must fit in memory).

The optimal ways to implement this kind of queries with Redis has been described here:

store ip ranges in Redis

and here:

Redis or Mongo for determining if a number falls within ranges?

So the complexity of the optimal solution depends on the fact you consider the ranges of IPs can overlap or not.

Redis or Mongo for Determining If a Number Falls Within Ranges