How to Create an Index in Amazon Redshift

How to create an Index in Amazon Redshift

If you try and create an index (with a name) on a Redshift table:

create index IX1 on "SomeTable"("UserId");

You'll receive the error

An error occurred when executing the SQL command:
create index IX1 on "SomeTable"("UserId")
ERROR: SQL command "create index IX1 on "SomeTable"("UserId")" not supported on Redshift tables.

This is because, like other data warehouses, Redshift uses columnar storage, and as a result, many of the indexing techniques (like adding non-clustered indexes) used in other RDBMS aren't applicable.

You do however have the option of providing a single sort key per table, and you can also influence performance with a distribution key for sharding your data, and selecting appropriate compression encodings for each column to minimize storage and I/O overheads.

For example, in your case, you may elect to use UserId as a sort key:

create table if not exists "SomeTable"
(
"UserId" int,
"Name" text
)
sortkey("UserId");

You might want to read a few primers like these

Why does Redshift not need materialized views or indexes?

It's a bit disingenuous to be honest (in my opinion). Although RedShift has neither of these, I'm not sure that's the same as saying it wouldn't benefit from them.

Materialised Views

I have no real idea why they make this claim. Possibly because they consider the engine so performant that the gains from having them are minimal.

I would dispute this and the product I work on maintains its own materialised views and can show significant performance gains from doing so. Perhaps AWS believe I must be doing something wrong in the first place?

Indexes

RedShift does not have indexes.

It does have SORT ORDER which is exceptionally similar to a clustered index. It is simply a list of fields by which the data is ordered (like a composite clustered index).

It even has recently introduced INTERLEAVED SORT KEYS. This is a direct attempt to have multiple independent sort orders. Instead of ordering by a THEN b THEN c it effectively orders by each of them at the same time.

That becomes kind of possible because of how RedShift implements its column store.

- Each column is stored separately from each other column

- Each column is stored in 1MB blocks

- Each 1MB block has summary statistics

As well as being the storage pattern this effectively becomes a set of pseudo indexes.

- If the data is sorted by a then b then x
- But you want z = 1234
- RedShift looks at the block statistics (for column z) first

- Those stats will say the minimum and maximum values stored by that block

- This allows Redshift to skip many of those blocks in certain conditions

- This intern allows RedShift to identify which blocks to read from the other columns

List columns with indexes in Amazon Redshift

I figured out how to do it with the help of this https://bitbucket.org/zzzeek/sqlalchemy/pull-request/6/sqlalchemy-to-support-postgresql-80/diff

SELECT attname column_name, attnotnull, 
format_type(atttypid, atttypmod) as column_type, atttypmod,
i.indisprimary as primary_key,
col_description(attrelid, attnum) as description
FROM pg_attribute c
LEFT OUTER JOIN pg_index i
ON c.attrelid = i.indrelid AND i.indisprimary AND
c.attnum = ANY(string_to_array(textin(int2vectorout(i.indkey)), ' '))
where c.attnum > 0 AND NOT c.attisdropped AND c.attrelid = :tableOid
order by attnum

Oracle index to AWS Redshift Sortkey

Redshift is columnar database, and it doesn't have indexes in the same meaning as in Oracle at all.

You can think of Redshift's compound sort key (not interleaved) as IOT in Oracle (index organized table), with all the data sorted physically by this compound key.

If you create interleaved sort key on x columns, it will act as a separate index on each of x columns in some manner.

In any way, being columnar database, Redshift can outperform Oracle in many aggregation queries due to it's compression and data structure. The main factors that affect performance in Redshift are distribution style and key, sort key and columns encoding.

If you can't fit all your queries with one table structure, you can duplicate the table with different structure, but the same data. This approach is widely used in big data columnar databases (for example projections in Vertica) and helps to achieve performance with storage being the cost.

Please review this page with several useful tips about Redshift performance:
https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-techniques-for-amazon-redshift/

Convert indexes to sortkeys Redshift

You are asking the right questions so let's take these down one at a time. First, zonemaps are located on the leader node and stored on disk and the table data is stored on the compute nodes. They are located separate from each other. The zonemaps store the min and max values for every column for every 1MB block in a table. No matter if a column is in your sortkey list or not, there will be zonemap data for the block. When a column shows up in a WHERE clause Redshift will first compare to the zonemap data to decide if the block is needed for the query. If a block is not needed it won't be read from disk resulting in significant performance improvements for very large tables. I call this "block rejection". A few key points - This really only makes a difference on tables will 10s of millions of rows and when there are selective WHERE predicates.

So you have a number of reports each of which looks at the data by different aspects - common. You want all of these to work well, right? Now the first thing to note is that each table can have it's own sortkeys, they aren't linked. What is important is how does the choice of sortkeys affect the min and max values in the zonemaps for the columns you will use as WHERE clauses. With composite sortkeys you have to think about what impact later keys will have on the composition of the block - not much after the 3rd or 4th key. This is greatly impacted by the ordinality of the data but you get the idea. The good news is that sorting on one column will impact the zonemaps of all the columns so you don't always have to have a column in the sortkey list to get the benefit.

The question of compound vs interleaved sortkeys is a complicated one but remember you want to get high levels of block rejection as often as possible (and on the biggest tables). When different queries have different WHERE predicates it can be tricky to get a good mix of sortkeys to make this happen. In general compound sortkeys are easier to understand and have less table maintenance implications. You can inspect the zonemaps and see what impacts your sortkey choices are having and make informed decisions on how to adjust. If there are columns with low ordinality put those first so that the next sortkeys can have impact on the overall row order and therefore make block with different value ranges for these later keys. For these reasons I like compound keys over interleaved but there are cases where things will improve with interleaved keys. When you have high ordinality for all the columns and they are all equally important interleaved may be the right answer. I usually learn about the data trying to optimize compound keys that even if I end up with interleaved keys I can make smart choices about what columns I want in the sortkeys.

Just some metrics to help in you choice. Redshift can store 200,000 row elements in a single block and I've seen columns with over 2M elements per block. Blocks are distributed across the cluster so you need a lot of rows to fill up enough blocks that rejecting a high percentage of them is even possible. If you have a table of 5 million rows and you are sweating the sortkeys you are into the weeds. (Yes sorting can impact other aspects of the query like joining but these are sub-second improvements not make or break performance impacts.) Compression can have a huge impact on the number of row elements per block and therefore how many rows are represented in an entry in the zonemap. This can increase block rejection but will increase the read data needed to scan the entire table - a tradeoff you will want to make sure you are winning (1 query gets faster by 10 get slower is likely not a good tradeoff).

Your question about ordinality is a good one. If I sort my a high ordinality column first in a compound sortkey list this will set the overall order of the rows potentially making all other sortkeys impotent. However if I sort by a low ordinality column first then there is a lot of power left for other sortkeys to change the order of the rows and therefore the zonemap contents. For example if I have Col_A with only 100 unique values and Col_B which is a timestamp with 1microsecond resolution. If I sort by Col_B first all the rows are likely order just by sorting on this column. But if I sort by Col_A first there are lots of rows with the same value and the later sortkey (Col_B) can order these rows. Interleaved works the same way except which column is "first" changes by region of the table. If I interleave sort base on the same Col_A and Col_B above (just 2 sortkeys), then half the table will be sorted by Col_A first and half by Col_B first. For this example Col_A will be useless half of the time - not the best answer. Interleave sorting just modifies which column is use as the first sortkey throughout the table (and second and third if more keys are used). High ordinality in a sort key makes later sortkeys less powerful and this independent of sort style - it's just the interleave changes up which columns are early and which are late by region of the table.

Because ordinality of sortkeys can be such an important factor in gaining block rejection across many WHERE predicates that it is common to add derived columns to tables to hold lower ordinality versions of other columns. In the example above I might add Col_B2 to the table and have if just hold the year and month (month truncated date) of Col_B. I would use Col_B2 in my sortkey list but my queries would still be referencing Col_B. It "roughly" sorts based on Col_B so that Col_A can have some sorting power if it was to come later in the sortkey list. This is a common reason for making data model changes when moving Redshift.

It is also critical that "block rejecting" WHERE clauses on written against the fact table column, not applied to a dimension table column after the join. Zonemap information is read BEFORE the query starts to execute and is done on the leader node - it can't see through joins. Another data model change is to denormalize some key information into the fact tables so these common where predicates can be applied to the fact table and zonemaps will be back in play.

Sorry for the tome but this is a deep topic which I've spent year optimizing. I hope this is of use to you and reach out if anything isn't clear (and I hope you have the DISTKEYS sorted out already :) ).



Related Topics



Leave a reply



Submit