Indexing and Query High Dimensional Data in Postgresql

indexing and query high dimensional data in postgreSQL

Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector.

Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances.

It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience.

Hopefully 128 dimensions will still be enough to get meaningful results.

How to do this?

First have an example table:

create extension cube;
create table vectors (id serial, vector cube);

Populate table with example data:

insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id;

Then try selecting:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1)
   ->  Sort  (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1)
         Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector))
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on vectors  (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1)
 Planning time: 0.172 ms
 Execution time: 1705.541 ms
(7 rows)

We should create an index:

create index vectors_vector_idx on vectors (vector);

Does it help:

explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;

--------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1)
   ->  Index Scan using vectors_vector_idx on vectors  (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1)
         Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube)
 Planning time: 0.146 ms
 Execution time: 145.474 ms
(5 rows)

At 8 dimensions, it does help.

postgresql index 100 dimensional and 25 million rows table

The problem was associated with a small amount of RAM(64 GB) for this task. It looks like the table is fully load into RAM and then there is a search. With indexes the table weighs 100 GB.

Querying a PostgreSQL multi-dimensional array data type in Rails 4

I am using a PostgreSQL multi-dimensional array to mimic an array of hashes

Those two things aren't really all that similar, and I wouldn't recommend attempting to use multidimensional arrays to model nested hashes.

Pavel is quite right that hstore is probably a lot closer to what you want, and it's indexable too. However, the current version of hstore (in Pg 9.3 and older) supports only single-level keys; it's a dictionary/hash that can contain only scalar string values. A planned enhancement to hstore for PostgreSQL 9.4 will hopefully bring multi-level nesting and JSON syntax compatibility.

Ordinary tables

You can model arbitrary-depth key/value chains(and trees/graphs) using edgelists and recursive CTEs, but this probably rather more complexity than you really want.

If you only need a fixed two-level key/value list, simply use a table that lists both key levels:

CREATE TABLE twolevel(key1 text, key2 text, thevalue text not null, PRIMARY KEY(key1,key2));

This lets you constrain against duplicate key pairs, which is nice.

You can also use two tables with a foreign key relationship between them. This gives you cascade deletes if you want, so removing a top level key removes all sub-level keys and associated values. It's easy enough to do that with the single-table approach, though.

Use one of these two approaches unless you have good reasons to do otherwise.

Hstore as text

Until the extended hstore is available, one option would be to store text representations of nested hstore fields. This is not pretty or efficient, but it's probably better than trying to search a multidimensional array.

CREATE TABLE nested_hstore(id integer, blah hstore);

insert into nested_hstore(id, blah) values 
(1, hstore( ARRAY['key1','key2'], ARRAY['"key1.1"=>"value1.1", "key1.2"=>"value1.2"', '"key2.1"=>"value2.1", "key2.2"=>"value2.2"']::hstore[]::text[]));

Test:

regress=> select (blah->'key1')::hstore->'key1.1' from nested_hstore ;
 ?column? 
----------
 value1.1
(1 row)

Because the hstore must be parsed each time it's not going to be super-fast, and you won't get the usual indexing benefits on the second level. Still, it's an option if you really genuinely need two-level hashes in fields.

Tables of hstore values

You can combine these two quite reasonably.

CREATE TABLE twolevel(key1 text, level2keyvalues hstore);

It seems pretty ugly to me, though; I'd prefer to be consistent one way or the other.

SQL/XML

Another option is to use SQL/XML, which you can index along arbitrary XPATH expressions. Again, this seems a bit too complicated.

Postgresql multicolumn index for BETWEEN and ORDER BY

I do not think it is possible (at least in vanilla postgresql, I do not know an extension that could help on that). The step of sorting records can be skipped only because indexes produce sorted records already.

However:

As mentioned in the doc, only B-tree indexes can be used for sorting (which makes sense, it is implemented using a search tree).
Your where and your order by are incompatible for a B-tree index:
- Because of having both clauses, you need to put 2 columns in the index (A, B)
- Data in the index is sorted by (A, B), therefore it is also sorted by A (which is why postgresql is able to index-scan the table fast when the where is on A only), but as a consequence, it is not sorted by B in the index (it is sorted by B only within each subset where A is constant, but not across the entire table).
- As you probably already know, having an index on B only will be of little help because of the where.

The provided example #2 shows postgresql is well-optimized for the case you filter on a single value of A.

If it is unacceptable to sort on the 2 columns (A, B), then I'm afraid you shouldn't expect more than this.

Indexing and Query High Dimensional Data in Postgresql