Data Structure for Fast Lookup with Multiple Criteria

Need a proper data structure or index for fast user lookup based on 3d points and importance factors

Since you haven't gotten any answers yet, I thought I would at least contribute some thoughts. I have used a python k-d tree module for quickly searching nearest neighbor points:

http://code.google.com/p/python-kdtree/downloads/detail?name=kdtree.py

It takes arbitrary point lengths as long as they are the same sizes.

I'm not sure how you would want to apply the weighting of the "importance", but here is just a brainstorm on how to use the kdtree module to at least get the closest "people" to each point of a given person's set:

import numpy
from kdtree import KDTree
from itertools import chain

class PersonPoint(object):

    def __init__(self, person, point, factor):
        self.person = person 
        self.point = point 
        self.factor = factor 

    def __repr__(self):
        return '<%s: %s, %0.2f>' % (self.person, 
            ['%0.2f' % p for p in self.point], self.factor) 

    def __iter__(self):
        return self.point

    def __len__(self):
        return len(self.point)

    def __getitem__(self, i):
        return self.point[i]

people = {}
for name in ('bill', 'john', 'mary', 'jenny', 'phil', 'george'):
    factors = numpy.random.rand(6)
    points = numpy.random.rand(6, 3).tolist()
    people[name] = [PersonPoint(name, p, f) for p,f in zip(points, factors)]

bill_points = people['bill']
others = list(chain(*[people[name] for name in people if name != 'bill']))

tree = KDTree.construct_from_data(others)

for point in bill_points:
    # t=1 means only return the 1 closest.
    # You could set it higher to return more.
    print point, "=>", tree.query(point, t=1)[0]

Results:

<bill: ['0.22', '0.64', '0.14'], 0.07> => 
    <phil: ['0.23', '0.54', '0.11'], 0.90>

<bill: ['0.31', '0.87', '0.16'], 0.88> => 
    <phil: ['0.36', '0.80', '0.14'], 0.40>

<bill: ['0.34', '0.64', '0.25'], 0.65> => 
    <jenny: ['0.29', '0.77', '0.28'], 0.40>

<bill: ['0.24', '0.90', '0.23'], 0.53> => 
    <jenny: ['0.29', '0.77', '0.28'], 0.40>

<bill: ['0.50', '0.69', '0.06'], 0.68> => 
    <phil: ['0.36', '0.80', '0.14'], 0.40>

<bill: ['0.13', '0.67', '0.93'], 0.54> => 
    <jenny: ['0.05', '0.62', '0.94'], 0.84>

I figured with the result, you could look at the most frequent matched "person" or then consider the weights. Or maybe you can total up the important factors in the results and then take the highest rated one. That way, if mary only matched once but had like a 10 factor, and phil had 3 matched but only totaled to 5, mary might be more relevant?

I know you have a more robust function for creating an index but it would require going through every point in your collection.

Which data structures to use when storing multiple entities with multiple query criteria?

For getting the storage space number I used a min heap approach, PriorityQueue. This works in O(log n) time, removal and insertion both.

I used 2 BiMaps, self-created data structures, for storing the mapping between UIN, color and storage space number. These BiMaps used internally a HashMap and an array of size N.

In first BiMap(BiMap1), a HashMap<color, Set<StorageSpace>> stores the mapping of color to the list of storage spaces's. And a String array String[] colorSpace which stores the color at the storage space index.

In the Second BiMap(BiMap2), a HashMap<UIN, storageSpace> stores the mapping between UIN and storageSpace. And a string arrayString[] uinSpace` stores the UIN at the storage space index.

Querying is straight forward with this approach:

When the input is color, show all the UIN associated with this color.
Get the List of storage spaces from BiMap1, for these spaces use the array in BiMap2 to get the corresponding UIN's.
When the input is color, show all the numbers where these packages are placed(storage space number). Use BiMap1's HashMap to get the list.
Show where an item with a given UIN is placed, i.e. storage space number. Use BiMap2 to get the values from the HashMap.

Now when we are given a storage space to remove, both the BiMaps have to be updated. In BiMap1 get the entry from the array, get the corersponding Set, and remove the space number from this set. From BiMap2 get the UIN from the array, remove it and also remove it from the HashMap.

For both the BiMaps the removal and the insert operations are O(1). And the Min heap works in O(Log n), hence the total time complexity is O(Log N)

Data structure for making decisions on multiple conditions

In fact what you describe is not an example of decision tree. Still there are some ways you can optimize your process. I would advice you to compute some kind of hash value for the attribute set in each mapping. For unset attributes add another "fake" value that means "unset".
After that iterate over the file and compare the hash computed for the query's attribute set to the hash of the attribute set of each mapping in a row. Only compare attributes if the hashes are the same(to avoid collision problems). This approach should speed comparisons significantly.

You can further improve the above approach - create a hash map between hash codes and the mappings that have this hash code. Don't to keep the mappings for a hash code in the same order as found in the file! After that you will only iterate over the mappings that have the same hash code and in the perfect case where no collisions happen this will be as good as you can get.

Data structure with fast sorted insertion, sorted deletion and lookup

I believe such data structure does not exist. Constant lookup for any element (e.g. indexing) requires contiguous memory, which makes insertion impossible to do in less than O(n) if you want to keep the range sorted.

There are arguments for hash tables and wrappers around hash tables, but there are two things to keep in mind when mentioning them:

Hash tables have average access (insertion, deletion, find) in O(1), but that assumes very few hash collisions. If you wish to meet the requirements in regard to pessimistic complexities, hash tables are out of the quesion since their pessimistic access time is O(n).
Hash tables are, by their nature, unordered. They, most of the time, do have internal arrays for storing (for example) buckets of data, but the elements themselves are neither all in contiguous memory nor ordered by some property (other than maybe modulo of their hash, which itself should produce very different values for similar objects).

To not leave you empty handed - if you wish to get as close to your requirements as possible, you need to specify which complexities would you sacrifice to achieve the others. I'd like to propose either std::priority_queue or std::multiset.

std::priority_queue provides access only to the top element - it is guaranteed that it will be either the smallest or the greatest (depending on the comparator you use to specify the relation) element in the collection. Insertion and deletion are both achieved in O(log_n) time.
std::multiset* provides access to every element inside it, but at a higher cost - O(log_n). It also achieves O(log_n) in insertion and deletion.

*_{Careful - do not confuse with std::set, which does not allow for duplicate elements.}

Data structure for quick time interval look up

What you are looking for is an Interval Tree (which is a type of Range Tree).

These have logarithmic lookup time like other tree structures (e.g., RB trees), so you should see comparable performance to using something like a Java TreeMap or an STL map.

Code for Red-black trees and interval trees from MIT
There is a C++ implementation in the CGAL Library.
Here's a C# Implementation

How would you implement a fast lookup by multiple keys?

You could still use regular dictionaries where the key could be a custom type like this:

public class CompositeKey
{
     public CompositeKey(string firstName, string lastName, string zipCode)
     {
           FirstName = firstName;
           LastName = lastName;
           ZipCode = zipCode;
     }

     public string FirstName { get; }
     public string LastName { get; }
     public string ZipCode { get; }
}

Now I would override Equals and GetHashCode on CompositeKey to provide what makes a composite key unique so Dictionary<TKey, TValue> would be able to store unique composite keys.

Finally, I would be able to query the dictionary this way:

var value = dict[new CompositeKey(firstName: "Matías", lastName: "Fidemraizer" )];

OP asked this question in some comment:

I thought about this approach but how would you query the dictionary
for "FirstName = "Matias" only?

Since you're overriding both Equals and GetHashCode, you can add all combinations as keys in the whole dictionary and they can all co-exist there:

Person person = new Person { /* Set members here */ }

// Note that I'll add many keys that have the same value
dict.Add(new CompositeKey(name: "Matías"), person);
dict.Add(new CompositeKey(lastName: "Fidemraizer"), person);
dict.Add(new CompositeKey(firstName: "Matías", lastName: "Fidemraizer"), person);

Each key will result in a different hash code so they can all co-exist in the same dictionary and they will provide a powerful tool to query by many criterias and criteria combinations.

Another approach

Other approach could be using multiple dictionaries where their keys are concatenations of the whole values using some convention and the values are instances of the whole class:

Dictionary<string, Person> names = new Dictionary<string, Person>();
names.Add("matias", new Person { /* Set members here */ });

Dictionary<string, Person> names = new Dictionary<string, Person>();
names.Add("matias:fidemraizer", new Person { /* Set members here */ });

// And so on, for every criteria you want to search...

Later you would implement a proxy to determine what dictionary to query based on the given criteria.

What about Redis

Actually you should take a look at Redis, which is a key-value store with complex data structures as values like hashes, sets, sorted sets and many more. That is, you could centralize your cache and distribute it using Redis, and you cache could be consumed by many applications.

It's extremely simple to use and install (it's an executable of less than 10MB...).

@Paparazzi has raised an issue with the dictionary approach

He has said:

What about a second person with the same first name?

If OP would need to consider this case (yes, it's not an exceptional case, so it's worth the effort to consider it!), it seems like OP would need to store data in a dictionary where keys are the whole composite keys and values should be List<Person>, HashSet<Person> or even LinkedList<Person>.

Furthermore, this would mean that one key (slot) would be able to store many persons, and a query like get a person with first name "Matías" would always return an implementation of IEnumerable<Person> (list, hash, linkedlist...), where the whole returned collection would be found persons:

KeyValuePair<CompositeKey, ISet<Person>> result;

if(dictionary.TryGetValue(new CompositeKey(firstName: "Matías"), out result))
{
    // I've got either one or many results and I'll decide what to do in
    // that case!
}

Also, this enhanced approach has another possible issue. When you query with a composite key like new CompositeKey(firstName: "Matías") and the whole dictionary store could have stored more than a person with "Matías" first name, you'll get an ISet<Person>, IList<Person> or LinkedList<Person>.

The first search to get one or many results has a complexity O(1) (constant time) because the whole composite key is stored based on its hash code, but the returned result of the first search isn't a dictionary anymore and any search against them is going to be O(N) (the more items you get, the more time is taken to find a result).

BTW, if you're trying to find a person by its first name, it's because you know that you can get more than a result and you can't expect to find one unless only one person with the whole first name has been stored in the dictionary.

So it seems that you'll need to disambiguate results if their count is greater than 1, and this can be done either performing another O(1) search by providing a composite key with more than the first name, or some human user (or artificial intelligence...) will need to manually choose one of the results.

In summary:

If you look for a person by providing one component, you're taking the risk of getting more than a result. Then, if it's an application with UI or some kind of artificial intelligence, there should happen no search at all, but choice an item from the result directly (which is an operation with O(1) complexity):

KeyValuePair<CompositeKey, ISet<Person>> result;

if(dictionary.TryGetValue(new CompositeKey(firstName: "Matías"), out result))
{
    if(result.Value.Count > 1)
    {
         // Here you would show the user what you've found in the UI
         // and the whole user would choose one of the results directly,
         // which is an operation with O(1) complexity 
    }
    else if(result.Value.Count <= 1)
    {
         // OK, I got 0 or 1 result, this is easier than I thought! ;)
    }
}

If you look for a person by providing one component, and once your application realizes that it got more than a result it can automatically provide another component, you won't perform a search against the result, but you'll provide a new composite key providing more components against the main dictionary and luckily, you'll get a single result.

public KeyValuePair<CompositeKey, ISet<Person>> SearchPerson(CompositeKey key)
{
    KeyValuePair<CompositeKey, ISet<Person>> result;

    if(dictionary.TryGetValue(new CompositeKey(firstName: "Matías"), out result))
    {
        if(result.Value.Count > 1)
        {
            // Oops! More than one result..... BUT I already know another 
            // component that will make the whole key absolutely unique, so
            // I'll call this method recursively to specialize the search even
            // more. Obviously, I've hardcoded the ZIP code as a sample, but
            // in a real-world case, who knows from where I would get this 
            // ZIP code... Maybe from some geolocalization query based on current
            // user's location?
            // Wait, it might happen that a person called Matías could live
            // in a location near be so this other person would have stored
            // the same ZIP code... Well, this goes outside the scope of this
            // Q&A. It's just an example of what to do, in an actual application
            // there should be many other choices to disambiguate persons
            // automatically...
            return SearchPerson(new CompositeKey(firstName: key.FirstName, zipCode: "03984"));

        }
        else if(result.Value.Count <= 1)
        {
             // OK, I got 0 or 1 result, this is easier than I thought! ;)
        }
    }
}

Data structure with efficient manipulation and retrieval by both key and index

I think you can do this with two red-black trees: A key-lookup tree to store the keys ordered by a compare function, and an index-lookup tree, with the keys in arbitrary ordering, as in a list. Each index-lookup node must have a 'size' field - a Red-Black tree can do lookup by index if a 'size' field is included in each node. See for example the RedBlackTreeSet implementation in the C5 Generic Collection Library.

Each entry in the key-lookup tree needs a pointer across to its corresponding entry in the index-lookup tree.As well as left- and right-node pointers, the index-lookup tree will require a parent pointer field to allow bottom-to-top navigation as well as top-to-bottom.

In all, six pointers are required for each key: the usual left and right pointers in both nodes, plus the pointer from the key-lookup-node to the index-lookup-node, plus the parent pointer in each of the index-lookup-nodes. You would also need a pointer in each node to point to the stored value.

Operations:

Append - An append operation would insert the key into both trees - once in the key-lookup tree, at a position determined by the compare function, and again in the rightmost position of the index-lookup tree. Insertion into a red-black tree is a logarithmic time operation.

Lookup by key - this is done on the key-lookup tree, using the compare function to find the correct position - O(log(n))

Lookup by index - this can be done on the index-lookup field, as mentioned above - O(log(n))

Get index from key - first lookup the key in the key-lookup tree O(log(n)). Follow the pointer across to the index-lookup tree. Follow the parent pointers up to the root node, (O(log(n)) for a balanced tree). Use the 'size' fields on the way up to determine the index of the key. - O(log(n)) overall.

Delete by index - lookup the item in the index-lookup tree. Delete from index-lookup tree. Lookup the located key in the key-lookup tree. Delete from the key-lookup tree. All operations are O(log(n)) , so delete is O(log(n)) overall.

Delete by key - use 'Get index from key' to get the index of the key. Delete by index from the index-lookup tree. Delete by key from the key-lookup tree. O(log(n)) overall.

This structure also supports O(log(n)) insertion at any arbitrary position, not just at the end.

Storage overhead is obviously considerable, but remains O(n). Time complexity meets all the requirements.

Unfortunately I am not aware of any implementation of this structure.

Update: It occurs to me you can combine a tree with a hash table to get O(1) by-key lookup. Instead of having two trees, as I suggest above, use a hash table for the by-key lookup, and a balanced order-statistics tree for the by-position lookup, as above, but have the slots of the hash table contain pointers to the nodes of the balanced tree for doing the get-list-position-by-key lookup. By-key lookups are now O(1), with everything else staying O(ln(n)) on average. Of course, you now get the occasional O(n) re-hash penalty, as with any hash-table.

Optimal storage for string/integer pairs with fast lookup of strings?

The second link that you include in the question is not applicable. That is a question concerning sorting rather than efficient lookup. Although you discuss sorting a number of times in your question, you do not have a requirement to sort. Your requirement is simply a dictionary, also known as an associative array. Of course, you can implement that by sorting an array and using binary search for your lookup, but sorting is not a requirement. You simply need an efficient dictionary.

Out of the box, the most efficient and convenient data structure for your problem is TDictionary<string, Integer>. This has lookup complexity of O(1) and so scales well for large collections. For smaller collections a binary search based lookup with lookup complexity of O(log n) can be competitive and can indeed out-perform a dictionary.

Cosmin Prund wrote an excellent answer here on SO where he compared the performance of dictionary lookup against binary search based lookup. I recommend you have a read. I would say that for small containers, performance is probably not that big a problem for you. So even though binary search may be quicker, it probably does not matter because your performance is good either way. But performance probably becomes an issue for larger containers and that's where the dictionary is always stronger. For large enough containers, the performance of binary search may become unacceptable.

I'm sure that it is possible to produce more efficient implementations of dictionaries than the Embarcadero one, but I'd also say that the Embarcadero implementation is perfectly solid. It uses a decent hash function and does not have any glaring weaknesses.

In terms of memory complexity, there's little to choose between a dictionary and a sorted array. It's not possible to improve on a sorted array for memory use.

I suggest that you start with TDictionary<string, Integer> and only look beyond that if your performance requirements are not met.