indexing and query high dimensional data in postgreSQL
Perhaps a better choice would be the cube extension, since your area of interest is not individual integer, but full vector.
Cube supports GiST indexing, and Postgres 9.6 will also bring KNN indexing to cubes, supporting euclidean, taxicab (aka Manhattan) and chebishev distances.
It is a bit annoying that 9.6 is still in development, however there's no problem backporting patch for cube extension to 9.5 and I say that from experience.
Hopefully 128 dimensions will still be enough to get meaningful results.
How to do this?
First have an example table:
create extension cube;
create table vectors (id serial, vector cube);
Populate table with example data:
insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 2000000) id;
Then try selecting:
explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Limit (cost=123352.07..123352.09 rows=10 width=76) (actual time=1705.499..1705.501 rows=10 loops=1)
-> Sort (cost=123352.07..129852.07 rows=2600000 width=76) (actual time=1705.496..1705.497 rows=10 loops=1)
Sort Key: (('(966, 82, 765, 343, 600, 718, 338, 505)'::cube <#> vector))
Sort Method: top-N heapsort Memory: 26kB
-> Seq Scan on vectors (cost=0.00..67167.00 rows=2600000 width=76) (actual time=0.038..998.864 rows=2600000 loops=1)
Planning time: 0.172 ms
Execution time: 1705.541 ms
(7 rows)
We should create an index:
create index vectors_vector_idx on vectors (vector);
Does it help:
explain analyze SELECT * from vectors
order by cube(ARRAY[966,82,765,343,600,718,338,505]) <#> vector asc limit 10;
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.41..1.93 rows=10 width=76) (actual time=41.339..143.915 rows=10 loops=1)
-> Index Scan using vectors_vector_idx on vectors (cost=0.41..393704.41 rows=2600000 width=76) (actual time=41.336..143.902 rows=10 loops=1)
Order By: (vector <#> '(966, 82, 765, 343, 600, 718, 338, 505)'::cube)
Planning time: 0.146 ms
Execution time: 145.474 ms
(5 rows)
At 8 dimensions, it does help.
postgresql index 100 dimensional and 25 million rows table
The problem was associated with a small amount of RAM(64 GB) for this task. It looks like the table is fully load into RAM and then there is a search. With indexes the table weighs 100 GB.
Querying a PostgreSQL multi-dimensional array data type in Rails 4
I am using a PostgreSQL multi-dimensional array to mimic an array of hashes
Those two things aren't really all that similar, and I wouldn't recommend attempting to use multidimensional arrays to model nested hashes.
Pavel is quite right that hstore is probably a lot closer to what you want, and it's indexable too. However, the current version of hstore (in Pg 9.3 and older) supports only single-level keys; it's a dictionary/hash that can contain only scalar string values. A planned enhancement to hstore for PostgreSQL 9.4 will hopefully bring multi-level nesting and JSON syntax compatibility.
Ordinary tables
You can model arbitrary-depth key/value chains(and trees/graphs) using edgelists and recursive CTEs, but this probably rather more complexity than you really want.
If you only need a fixed two-level key/value list, simply use a table that lists both key levels:
CREATE TABLE twolevel(key1 text, key2 text, thevalue text not null, PRIMARY KEY(key1,key2));
This lets you constrain against duplicate key pairs, which is nice.
You can also use two tables with a foreign key relationship between them. This gives you cascade deletes if you want, so removing a top level key removes all sub-level keys and associated values. It's easy enough to do that with the single-table approach, though.
Use one of these two approaches unless you have good reasons to do otherwise.
Hstore as text
Until the extended hstore is available, one option would be to store text representations of nested hstore fields. This is not pretty or efficient, but it's probably better than trying to search a multidimensional array.
CREATE TABLE nested_hstore(id integer, blah hstore);
insert into nested_hstore(id, blah) values
(1, hstore( ARRAY['key1','key2'], ARRAY['"key1.1"=>"value1.1", "key1.2"=>"value1.2"', '"key2.1"=>"value2.1", "key2.2"=>"value2.2"']::hstore[]::text[]));
Test:
regress=> select (blah->'key1')::hstore->'key1.1' from nested_hstore ;
?column?
----------
value1.1
(1 row)
Because the hstore must be parsed each time it's not going to be super-fast, and you won't get the usual indexing benefits on the second level. Still, it's an option if you really genuinely need two-level hashes in fields.
Tables of hstore values
You can combine these two quite reasonably.
CREATE TABLE twolevel(key1 text, level2keyvalues hstore);
It seems pretty ugly to me, though; I'd prefer to be consistent one way or the other.
SQL/XML
Another option is to use SQL/XML, which you can index along arbitrary XPATH expressions. Again, this seems a bit too complicated.
Postgresql multicolumn index for BETWEEN and ORDER BY
I do not think it is possible (at least in vanilla postgresql, I do not know an extension that could help on that). The step of sorting records can be skipped only because indexes produce sorted records already.
However:
- As mentioned in the doc, only B-tree indexes can be used for sorting (which makes sense, it is implemented using a search tree).
- Your
where
and yourorder by
are incompatible for a B-tree index:- Because of having both clauses, you need to put 2 columns in the index
(A, B)
- Data in the index is sorted by
(A, B)
, therefore it is also sorted byA
(which is why postgresql is able to index-scan the table fast when thewhere
is onA
only), but as a consequence, it is not sorted byB
in the index (it is sorted byB
only within each subset whereA
is constant, but not across the entire table). - As you probably already know, having an index on
B
only will be of little help because of thewhere
.
- Because of having both clauses, you need to put 2 columns in the index
The provided example #2 shows postgresql is well-optimized for the case you filter on a single value of A
.
If it is unacceptable to sort on the 2 columns (A, B)
, then I'm afraid you shouldn't expect more than this.
Related Topics
How to Call Scalar Function in SQL Server 2008
If I Update a View, Will My Original Tables Get Updated
Sql Server - Return Schema for Sysobjects
How to Use a Trim Function in SQL Server
Efficiently Duplicate Some Rows in Postgresql Table
Using a View with No Primary Key with Entity
Rails Brakeman Warning of SQL Injection
Rails Activerecord Query Using Inner Join
Joining Two Separate Queries in a Postgresql ...Query... (Possible or Not Possible)
Determine Table Referenced in a View in SQL Server
How to Find Rows Which Are Duplicates by a Key But Not Duplicates in All Columns
Mybatis Rowbounds Doesn't Limit Query Results
Using Sqldf and Rpostgresql Together
Use a Like Statement on SQL Server Xml Datatype